Consistency of empirical Bayes and kernel flow for hierarchical parameter estimation

Chen, Yifan; Owhadi, Houman; Stuart, Andrew

doi:10.1090/mcom/3649

Consistency of empirical Bayes and kernel flow for hierarchical parameter estimation
HTML articles powered by AMS MathViewer

by Yifan Chen, Houman Owhadi and Andrew M. Stuart HTML | PDF

Math. Comp. 90 (2021), 2527-2578 Request permission

Abstract:

Gaussian process regression has proven very powerful in statistics, machine learning and inverse problems. A crucial aspect of the success of this methodology, in a wide range of applications to complex and real-world problems, is hierarchical modeling and learning of hyperparameters. The purpose of this paper is to study two paradigms of learning hierarchical parameters: one is from the probabilistic Bayesian perspective, in particular, the empirical Bayes approach that has been largely used in Bayesian statistics; the other is from the deterministic and approximation theoretic view, and in particular the kernel flow algorithm that was proposed recently in the machine learning literature. Analysis of their consistency in the large data limit, as well as explicit identification of their implicit bias in parameter learning, are established in this paper for a Matérn-like model on the torus. A particular technical challenge we overcome is the learning of the regularity parameter in the Matérn-like field, for which consistency results have been very scarce in the spatial statistics literature. Moreover, we conduct extensive numerical experiments beyond the Matérn-like model, comparing the two algorithms further. These experiments demonstrate learning of other hierarchical parameters, such as amplitude and lengthscale; they also illustrate the setting of model misspecification in which the kernel flow approach could show superior performance to the more traditional empirical Bayes approach.

References

David M. Allen, The relationship between variable selection and data augmentation and a method for prediction, Technometrics 16 (1974), 125–127. MR 343481, DOI 10.2307/1267500
S.-i. Amari and S. Wu, Improving support vector machine classifiers by modifying kernel functions, Neural Net. 12 (1999), no. 6, 783–789.
François Bachoc, Cross validation and maximum likelihood estimations of hyper-parameters of Gaussian processes with model misspecification, Comput. Statist. Data Anal. 66 (2013), 55–69. MR 3064023, DOI 10.1016/j.csda.2013.03.016
François Bachoc, Agnès Lagnoux, and Thi Mong Ngoc Nguyen, Cross-validation estimation of covariance parameters under fixed-domain asymptotics, J. Multivariate Anal. 160 (2017), 42–67. MR 3688689, DOI 10.1016/j.jmva.2017.06.003
Vladimir I. Bogachev, Gaussian measures, Mathematical Surveys and Monographs, vol. 62, American Mathematical Society, Providence, RI, 1998. MR 1642391, DOI 10.1090/surv/062
C. Cortes, M. Kloft, and M. Mohri, Learning kernels using local rademacher complexity, Advances in Neural Information Processing Systems (NeurIPS Proceedings) 26, pp. 2760–2768, 2013.
Masoumeh Dashti and Andrew M. Stuart, The Bayesian approach to inverse problems, Handbook of uncertainty quantification. Vol. 1, 2, 3, Springer, Cham, 2017, pp. 311–428. MR 3839555
Carl de Boor, Ronald A. DeVore, and Amos Ron, Approximation from shift-invariant subspaces of $L_2(\mathbf R^d)$, Trans. Amer. Math. Soc. 341 (1994), no. 2, 787–806. MR 1195508, DOI 10.1090/S0002-9947-1994-1195508-X
Matthew M. Dunlop, Tapio Helin, and Andrew M. Stuart, Hyperparameter estimation in Bayesian MAP estimation: parameterizations and consistency, SMAI J. Comput. Math. 6 (2020), 69–100. MR 4100532, DOI 10.5802/smai-jcm.62
S. Geisser, The predictive sample reuse method with applications, Journal of the American statistical Association, 70 (1975), no. 350, 320–328.
J. K. Ghosh and R. V. Ramamoorthi, Bayesian nonparametrics, Springer Series in Statistics, Springer-Verlag, New York, 2003. MR 1992245
Peter Guttorp and Tilmann Gneiting, Studies in the history of probability and statistics. XLIX. On the Matérn correlation family, Biometrika 93 (2006), no. 4, 989–995. MR 2285084, DOI 10.1093/biomet/93.4.989
Boumediene Hamzi and Houman Owhadi, Learning dynamical systems from data: a simple cross-validation perspective, part I: Parametric kernel flows, Phys. D 421 (2021), Paper No. 132817, 10. MR 4233447, DOI 10.1016/j.physd.2020.132817
K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
N. L. Hjort, C. Holmes, P. Müller, and S. G. Walker. Bayesian Nonparametrics, volume 28. Cambridge University Press, 2010.
Thomas Hofmann, Bernhard Schölkopf, and Alexander J. Smola, Kernel methods in machine learning, Ann. Statist. 36 (2008), no. 3, 1171–1220. MR 2418654, DOI 10.1214/009053607000000677
A. Jacot, F. Gabriel, and C. Hongler, Neural tangent kernel: convergence and generalization in neural networks, Advances in Neural Information Processing Systems, 2018, pp. 8571–8580.
B. T. Knapik, B. T. Szabó, A. W. van der Vaart, and J. H. van Zanten, Bayes procedures for adaptive inference in inverse problems for the white noise model, Probab. Theory Related Fields 164 (2016), no. 3-4, 771–813. MR 3477780, DOI 10.1007/s00440-015-0619-7
B. T. Knapik, A. W. van der Vaart, and J. H. van Zanten, Bayesian inverse problems with Gaussian priors, Ann. Statist. 39 (2011), no. 5, 2626–2657. MR 2906881, DOI 10.1214/11-AOS920
R. Kohavi, et al, A study of cross-validation and bootstrap for accuracy estimation and model selection, IJCAI, Montreal, Canada, vol. 14, 1995, pp. 1137–1145.
Robert Kohn, Craig F. Ansley, and David Tharm, The performance of cross-validation and maximum likelihood estimators of spline smoothing parameters, J. Amer. Statist. Assoc. 86 (1991), no. 416, 1042–1050. MR 1146351
Finn Lindgren, Håvard Rue, and Johan Lindström, An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach, J. R. Stat. Soc. Ser. B Stat. Methodol. 73 (2011), no. 4, 423–498. With discussion and a reply by the authors. MR 2853727, DOI 10.1111/j.1467-9868.2011.00777.x
H. Owhadi, Do ideas have shape? Plato’s theory of forms as the continuous limit of artificial neural networks, Preprint, arXiv:2008.03920, 2020.
Houman Owhadi and Clint Scovel, Operator-adapted wavelets, fast solvers, and numerical homogenization, Cambridge Monographs on Applied and Computational Mathematics, vol. 35, Cambridge University Press, Cambridge, 2019. From a game theoretic approach to numerical approximation and algorithm design. MR 3971243
Houman Owhadi and Gene Ryan Yoo, Kernel flows: from learning kernels from data into the abyss, J. Comput. Phys. 389 (2019), 22–47. MR 3934896, DOI 10.1016/j.jcp.2019.03.040
O. Perrin and P. Monestiez, Modeling of non-stationary spatial structure using parametric radial basis deformations, GeoENV II—Geostatistics for Environmental Applications, Springer, 1999, pp. 175–186.
C. E. Rasmussen, Gaussian processes in machine learning, Summer School on Machine Learning, Springer, 2003, pp. 63–71.
Amos Ron, The $L_2$-approximation orders of principal shift-invariant spaces generated by a radial basis function, Numerical methods in approximation theory, Vol. 9 (Oberwolfach, 1991) Internat. Ser. Numer. Math., vol. 105, Birkhäuser, Basel, 1992, pp. 245–268. MR 1269365, DOI 10.1007/978-3-0348-8619-2_{1}4
P. D. Sampson and P. Guttorp, Nonparametric estimation of nonstationary spatial covariance structure, J. Amer. Statist. Assoc., 87 (1992), no. 417, 108–119.
M. Scheuerer, R. Schaback, and M. Schlather, Interpolation of spatial data—a stochastic or a deterministic problem?, European J. Appl. Math. 24 (2013), no. 4, 601–629. MR 3082868, DOI 10.1017/S0956792513000016
Alexandra M. Schmidt and Anthony O’Hagan, Bayesian inference for non-stationary spatial covariance structure via spatial deformations, J. R. Stat. Soc. Ser. B Stat. Methodol. 65 (2003), no. 3, 743–758. MR 1998632, DOI 10.1111/1467-9868.00413
Michael L. Stein, A comparison of generalized cross validation and modified maximum likelihood for estimating the parameters of a stochastic process, Ann. Statist. 18 (1990), no. 3, 1139–1157. MR 1062702, DOI 10.1214/aos/1176347743
Michael L. Stein, Interpolation of spatial data, Springer Series in Statistics, Springer-Verlag, New York, 1999. Some theory for Kriging. MR 1697409, DOI 10.1007/978-1-4612-1494-6
Charles J. Stone, An asymptotically optimal window selection rule for kernel density estimates, Ann. Statist. 12 (1984), no. 4, 1285–1297. MR 760688, DOI 10.1214/aos/1176346792
Andrew M. Stuart and Aretha L. Teckentrup, Posterior consistency for Gaussian process approximations of Bayesian posterior distributions, Math. Comp. 87 (2018), no. 310, 721–753. MR 3739215, DOI 10.1090/mcom/3244
Aretha L. Teckentrup, Convergence of Gaussian process regression with estimated hyper-parameters and applications in Bayesian inverse problems, SIAM/ASA J. Uncertain. Quantif. 8 (2020), no. 4, 1310–1337. MR 4164077, DOI 10.1137/19M1284816
Mark J. van der Laan, Sandrine Dudoit, and Aad W. van der Vaart, The cross-validated adaptive epsilon-net estimator, Statist. Decisions 24 (2006), no. 3, 373–395. MR 2305113, DOI 10.1524/stnd.2006.24.3.373
Aad W. van der Vaart and Jon A. Wellner, Weak convergence and empirical processes, Springer Series in Statistics, Springer-Verlag, New York, 1996. With applications to statistics. MR 1385671, DOI 10.1007/978-1-4757-2545-2
Aad W. van der Vaart, Sandrine Dudoit, and Mark J. van der Laan, Oracle inequalities for multi-fold cross validation, Statist. Decisions 24 (2006), no. 3, 351–371. MR 2305112, DOI 10.1524/stnd.2006.24.3.351
G. Wahba and J. Wendelberger, Some new mathematical methods for variational objective analysis using splines and cross validation, Monthly Weather Rev., 108 (1980), no. 8, 1122–1143.
J. J. Warnes and B. D. Ripley, Problems with likelihood estimation of covariance functions of spatial Gaussian processes, Biometrika 74 (1987), no. 3, 640–642. MR 909370, DOI 10.1093/biomet/74.3.640
Holger Wendland, Scattered data approximation, Cambridge Monographs on Applied and Computational Mathematics, vol. 17, Cambridge University Press, Cambridge, 2005. MR 2131724
P. Whittle, On stationary processes in the plane, Biometrika 41 (1954), 434–449. MR 67450, DOI 10.1093/biomet/41.3-4.434
A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing, Deep kernel learning, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics 51, 2016, pp. 370–378.
Yuhong Yang, Consistency of cross validation for comparing regression procedures, Ann. Statist. 35 (2007), no. 6, 2450–2473. MR 2382654, DOI 10.1214/009053607000000514
Zhiliang Ying, Asymptotic properties of a maximum likelihood estimator with data from a Gaussian process, J. Multivariate Anal. 36 (1991), no. 2, 280–296. MR 1096671, DOI 10.1016/0047-259X(91)90062-7
G. R. Yoo and H. Owhadi, Deep regularization and direct training of the inner layers of neural networks with kernel flows, Preprint, arXiv:2002.08335, 2020.
Hao Zhang and Yong Wang, Kriging and cross-validation for massive spatial data, Environmetrics 21 (2010), no. 3-4, 290–304. MR 2842244, DOI 10.1002/env.1023