Communicated by Notices Associate Editor Richard Levine
Grace Wahba (née Goldsmith, born August 3, 1934) joined the Department of Statistics at the University of Wisconsin–Madison in 1967, and remained on its faculty until her retirement in 2019. During her luminous career at Madison of more than half a century, Grace graduated 39 doctoral students and has over 400 academic descendants to date. Over the years, Grace has worked on countless interesting problems, covering a broad spectrum from theory to applications. Above all, Grace is best known as the mother of the Wisconsin spline school, the primary driving force behind a rich family of data smoothing methods developed by herself, collaborators, and generations of students. Some of Grace’s life stories can be found in a recent entertaining piece by Nychka, Ma, and Bates 10. Here, we try to complement that piece with some technical discussions concerning the Wisconsin spline school.
As Grace’s former students, our first set of reading assignments consisted of the defining document of reproducing kernel Hilbert spaces by Aronszajn 1 and three early papers by Kimeldorf and Wahba 456. As we shall demonstrate below, some of those early results had far-reaching impacts on developments decades later. According to 10, many of the ideas of Kimeldorf and Wahba were inspired by discussions during tea parties at the Mathematics Research Center at UW Madison in the late 1960s/early 1970s, with participants including Issac Schoenberg, Carl de Boor, and Larry Schumaker.
In the sections to follow, we shall outline the smoothing spline approach to data smoothing, addressing numerous aspects and noting similarities and differences compared to related techniques. We highlight Grace’s many original contributions, but otherwise focus on the flow of presentation; for more accurate attributions of credit, one may consult the forward of 16 and the bibliographic notes in 3.
1. Smoothing Splines
Given pairs $(x_{i},y_{i})$,$i=1$, …, $n$,$x_{i}\in [a,b]$, one may obtain a smoothing spline via the minimization of
The minimizer of 1 is a piecewise cubic polynomial, three times differentiable, with the third derivative jumping at the distinctive $x_{i}$ points.
In the mathematics or numerical analysis literature, a spline typically refers to a piecewise polynomial, one is concerned with function interpolation or approximation, and the $(x_{i},y_{i})$ pairs are exact samples satisfying $y_{i}=\eta (x_{i})$.
With stochastic data, one does not have exact samples of the function and needs statistical models. A regression model behind 1 has $y_{i}=\eta (x_{i})+\epsilon _{i}$,$\epsilon _{i}\sim {N}(0,\sigma ^{2})$ independent, and the minimizer of 1 provides an estimate of the regression function $\eta (x)$.
The first term in 1 pushes for a close fit of $\eta$ to the data and the second term penalizes the roughness of $\eta$, with the smoothing parameter $\lambda$ controlling the tradeoff between the two conflicting goals. As $\lambda \rightarrow \infty$, one approaches a simple linear regression line $\eta (x)=\beta _{0}+\beta _{1}x$;$\lambda =0_{+}$ yields the minimum curvature interpolant.
Figure 1 illustrates the cubic smoothing spline of 1 applied to some data from a simulated motorcycle accident, found in R package MASS as data frame mcycle, where $x$ is time and $y$ is head acceleration. The data are in circles and the three lines correspond to $\log _{10}n\tilde{\lambda }=-8.5,-4.08,-1.5$, from rough to smooth (red, black, and green, respectively), with $\log _{10}n\tilde{\lambda }=-4.08$ (black color) selected by cross validation via 15 using ssanova0 in R package gss; $\lambda /\tilde{\lambda }=60^{3}$ accounts for the mapping of $[0,60]$ onto $[0,1]$ in the software implementation.
Regression analysis, a primary tool of supervised learning, is widely used in applications. Traditional parametric regression was developed when data were scarce, restricting $\eta (x)$ to low-dimensional function spaces with a few free parameters; this maintains sufficient samples per parameter to effectively control “variance” in estimation, but may incur serious model bias. As the sample size $n$ increases, “variance” is less of a concern and one looks to reduce model bias, yielding numerous non-parametric regression techniques that permit “flexible” forms of $\eta (x)$. The smoothing spline of 1 presents a tidy approach to non-parametric regression, tuning $\lambda$ to control the effective dimension of model space.
Many non-parametric regression methods exist, and all perform equally well in one dimension. The real challenge is in the modeling/estimation of functions on multi-dimensional domains, and generalizations of 1 lead to unparalleled operational convenience and a rich collection of modeling tools.
1.1. Penalized likelihood method
The penalized likelihood method results from an abstraction of 1. To estimate a function of interest $\eta$ on a generic domain $\mathcal{X}$ using stochastic data, one may minimize
where $L(\eta |\text{data})$ can be taken as the minus log likelihood of the data and $J(\eta )$ is a quadratic functional quantifying the roughness of $\eta$.
In a step towards an abstract $J(\eta )$, Kimeldorf and Wahba 456 considered $J(\eta )=\int _{a}^{b}(L\eta )^{2}(x)dx$ on $\mathcal{X}=[a,b]$ for some general differential operator $L$, with $J(\eta )=\int _{a}^{b}\big (\eta ^{(m)}(x)\big )^{2}dx$ being special cases yielding polynomial smoothing splines. More examples of $\mathcal{X}$ and $J(\eta )$ will follow shortly.
A smoothing spline is defined as the solution to a variational problem of the form given in 2. Depending on the configuration, it may or may not reduce to a piecewise polynomial.
The first term in 1 is proportional to the minus log likelihood of the Gaussian regression model stated above. Two more examples of $L(\eta |\text{data})$ follow.
Example 1 is a special case of non-Gaussian regression. A variant of Example 2 was studied by Silverman 11.
Shown in Figure 2 is an example of one-dimensional density estimation using $J(\eta )=\int _{a}^{b}\big (\eta ''(x)\big )^{2}dx$; the data are 272 waiting times between eruptions of the Old Faithful geyser, found in the R data frame faithful. Plotted are a cross-validated density estimate (see §3) using ssden in R package gss, along with a histogram of the data. This is an instance of unsupervised learning.
The penalized likelihood of 2 is in fact performing constrained maximum likelihood estimation,
using the Lagrange method, with $\lambda$ being the Lagrange multiplier; see 3, Theorem 2.12, where $\lambda \propto \rho ^{-1}$ is also quantified.
The null space of $J(\eta )$,$\mathcal{N}_{J}=\{\eta :J(\eta )=0\}$, specifies a parametric model for $\eta$. With a $\rho >0$, data-adaptive through the selection of $\lambda$, one allows for $\eta$ to depart from the parametric model.
1.2. Reproducing kernel Hilbert spaces
The minimization of 2 is implicitly in the space $\{\eta :J(\eta )<\infty \}$ or a subspace therein, and function evaluations typically appear in $L(\eta |{\text{data}})$. To facilitate analysis and computation, one needs a metric and a geometry in the function space, and needs the evaluation functional to be continuous.
A reproducing kernel Hilbert space is a Hilbert space $\mathcal{H}$ of functions on a domain $\mathcal{X}$ in which the evaluation functional $[x]\eta =\eta (x)$ is continuous, $\forall {x}\in \mathcal{X}$,$\forall \eta \in \mathcal{H}$. A comprehensive theory of reproducing kernel Hilbert spaces can be found in 1.
By Riesz representation, there exists a reproducing kernel, a non-negative definite bivariate function $R(x,y)$ dual to the inner product $\langle \cdot ,\cdot \rangle$ in $\mathcal{H}$, which satisfies $\langle {R}(x,\cdot ),\eta (\cdot )\rangle =\eta (x)$,$\forall {x}\in \mathcal{X}$,$\forall \eta \in \mathcal{H}$.
A reproducing kernel Hilbert space can also be generated from its reproducing kernel $R(x,y)$, for which any non-negative definite function qualifies, as the “column space” $\text{span}\{R(x,\cdot ),x\in \mathcal{X}\}$. The corresponding inner product may or may not have an explicit expression, however.
For use in 2, one takes $\mathcal{H}=\{\eta :J(\eta )<\infty \}$ equipped with $\langle \cdot ,\cdot \rangle =J(\cdot ,\cdot )+\tilde{J}(\cdot ,\cdot )$, where $J(\cdot ,\cdot )$ is the semi-inner product associated with the quadratic functional $J(\eta )$ and $\tilde{J}(\cdot ,\cdot )$ is an inner product in the null space $\mathcal{N}_{J}$. One has a tensor-sum decomposition $\mathcal{H}=\mathcal{N}_{J}\oplus \mathcal{H}_{J}$ with $J(\cdot ,\cdot )$ being a full inner product in $\mathcal{H}_{J}=\{\eta :J(\eta )<\infty ,\tilde{J}(\eta )=0\}$.
Facts concerning tensor-sum decompositions of reproducing kernel Hilbert spaces can be found in 3, Theorem 2.5, and technical details of Example 3 are in Craven and Wahba 2.
The theory of reproducing kernel Hilbert spaces provides an abstract mathematical framework encompassing a great variety of problems. The abstract setting allows many important issues, such as the computation and the asymptotic convergence of the minimizers of 2, be treated in a unified fashion. As summarized in Grace’s 1990 monograph 16, much of her work up to that date, from approximation theory to spline smoothing, fit under the general framework.
1.3. Tensor product splines
A statistical model should be interpretable, which distinguishes it from mere function approximation or some black-box predictor/classifier. Two main challenges in the non-parametric modeling of multivariate data are weak interpretability and the curse of dimensionality, which might be alleviated via a hierarchical structure of the functional ANOVA decomposition.
Functional ANOVA decomposition
Consider a bivariate function $\eta (x)=\eta (x_{\scriptscriptstyle \langle 1\scriptscriptstyle \rangle },x_{\scriptscriptstyle \langle 2\scriptscriptstyle \rangle })$ on $\mathcal{X}=\mathcal{X}_{1}\times \mathcal{X}_{2}$; subscripts in brackets denote coordinates of a point on a multidimensional domain while ordinary subscripts are reserved for multiple points. One may write
where $I$ is the identity operator, $A_{1}$,$A_{2}$ are averaging operators acting on arguments $x_{\scriptscriptstyle \langle 1\scriptscriptstyle \rangle }$,$x_{\scriptscriptstyle \langle 2\scriptscriptstyle \rangle }$, respectively, that satisfy $A1=1$,$\eta _{1}^{}$,$\eta _{2}^{}$ are main effects, and $\eta _{12}^{}$ is the interaction, satisfying $A_{1}\eta _{1}^{}=A_{1}\eta _{12}^{}=0$,$A_{2}\eta _{2}^{}=A_{2}\eta _{12}^{}=0$. Similar constructions in more than two dimensions are straightforward.
Examples of averaging operators include $A\eta =\int _{a}^{b}\eta (x)dx/(b-a)$ on $[a,b]$,$A\eta =\sum _{i=1}^{m}\eta (x_{i})/m$ on any domain; averaging operators on different axes are independent of each other.
For $\mathcal{X}_{1}\times \mathcal{X}_{2}$ discrete, $\eta (x_{\scriptscriptstyle \langle 1\scriptscriptstyle \rangle },x_{\scriptscriptstyle \langle 2\scriptscriptstyle \rangle })$ is a matrix of treatment means usually denoted by $\mu _{ij}$ in a standard two-way ANOVA model, with 4 in the form
where $\mu _{i\cdot }=\sum _{j}c_{j}\mu _{ij}$ for $\sum _{j}c_{j}=1$,$\mu _{\cdot {j}}=\sum _{i}d_{i}\mu _{ij}$ for $\sum _{i}d_{i}=1$, and $\mu _{\cdot \cdot }=\sum _{i,j}c_{j}d_{i}\mu _{ij}$.
Selective term elimination in functional ANOVA decompositions helps to combat the curse of dimensionality in estimation and facilitates the interpretation of the analysis. For example, the so-called additive models, those containing only main effects, are easier to estimate and interpret than ones involving interactions. As with classical ANOVA models on discrete domains, the inclusion of higher-order interactions are to be avoided in practice.
For $\eta$ a log density, absence of selected interactions may imply (conditional) independence structures among random variables. Taking random variables $(X,Y,Z)$ on domain $\mathcal{X}\times \mathcal{Y}\times \mathcal{Z}$, say, a log density of the form $\eta =\eta _{\emptyset }+\eta _{x}+\eta _{y}+\eta _{z}+\eta _{xy}+\eta _{xz}$ implies the conditional independence of $Y$ and $Z$ given $X$, or $Y\bot {Z}|X$, where the notation for ANOVA terms of $\eta (x,y,z)$ parallels that in 4.
Tensor product spaces
For the estimation of $\eta$ on product domains via 2, functional ANOVA decompositions can be built in through the construction of tensor product reproducing kernel Hilbert spaces.
To construct a reproducing kernel Hilbert space, it suffices to specify a reproducing kernel. The following example illustrates a construction on $\mathcal{X}=[0,1]^{2}$ using the results of Example 3.
The construction described in Example 4 is at an abstract level, requiring little specifics of cubic splines on $[0,1]$. All one needs are reproducing kernels on marginal domains. Functional ANOVA decomposition follows trivially from one-way ANOVA decompositions on marginal domains.
Denote by $\mathcal{H}_{\beta }$ the component spaces of $\mathcal{H}_{J}$ in Example 4 and write $R_{\beta }$,$J_{\beta }$ as the respective reproducing kernels and the associated square norms. With $R_{J}\!=\!\sum _{\beta }\theta _{\beta }R_{\beta }$, the corresponding square norm in $\mathcal{H}_{J}$ is given by $J(\eta )\!=\!\sum _{\beta }\theta _{\beta }^{-1}J_{\beta }(\eta )$, to be used in 2.
As an illustration, we fit a logistic regression model of Example 1 to the wesdr data found in R package gss, concerning the progression of diabetic retinopathy. One has $x=(x_{\scriptscriptstyle \langle 1\scriptscriptstyle \rangle },x_{\scriptscriptstyle \langle 2\scriptscriptstyle \rangle },x_{\scriptscriptstyle \langle 3\scriptscriptstyle \rangle })\in {R}^{3}$ and a binary $y\sim \text{Bin}\big (1,p(x)\big )$, and the full model would include three main effects, three two-way interactions, and a three-way interaction. After some exploration, the interaction terms are found to be negligible so an additive model appears adequate, and one of the main effects is actually linear. We finally fit a model
where $\eta =\log {p}/(1-p)$; this is actually a partial spline model discussed at the end of this section. Plotted in Figure 3 are estimated $\eta _{2}$ and $\eta _{3}$ along with their Bayesian confidence intervals (see §4). The rugs on the ceiling and the floor mark the observed $y=1$ and $y=0$, respectively, though $p(x)$ does not depend on $x_{\scriptscriptstyle \langle 2\scriptscriptstyle \rangle }$ or $x_{\scriptscriptstyle \langle 3\scriptscriptstyle \rangle }$ alone; the confidence intervals do get wider where data are sparse. The analysis was done using the gssanova facilities in package gss. Further details can be found in 3, §5.5.3.
Grace’s signature was all over the tensor product spline technique, from the inception of the idea to the ensuing rigorous developments, involving several of her students including your authors; see, e.g., 16, Chap. 10 and 18.
1.4. More splines
Real intervals are the most encountered domains in practice, which can be mapped onto $[0,1]$, and the cubic spline on $[0,1]$ with $J(\eta )=\int _{0}^{1}\big (\eta ''(x)\big )^{2}dx$ is the commonly used configuration for data smoothing in the setting. On domains other than real intervals or to accommodate various special needs, alternatives to cubic splines are sometimes called for.
We now present a variety of configurations tuned to various situations, which may be used directly in 2 on the respective designated domains, or be used as building blocks to construct tensor product splines.
Periodic splines
To accommodate recurring patterns such as circadian rhythms or seasonal effects, one may consider only periodic functions on a real interval. Mapping the interval onto $[0,1]$ and setting $J(\eta )=\int _{0}^{1}\big (\eta ''(x)\big )^{2}dx$ for $\eta$ periodic, one has $\mathcal{H}=\{1\}\oplus \mathcal{H}_{J}$, with component space reproducing kernels $R_{0}(x,y)=1$ and $R_{J}(x,y)=-k_{4}(x-y)$; see Craven and Wahba 2.
L-splines
On $[0,1]$, one may configure an L-spline by setting $J(\eta )=\int _{0}^{1}(L\eta )^{2}(x)h(x)dx$, where $L$ is a general differential operator and $h(x)>0$ is a weight function. With $L=D^{m}$ and $h(x)=1$, it reduces to a polynomial spline; $m=2$ yields the cubic spline.
When the null space $\mathcal{N}_{L}=\{\eta :L\eta =0\}$ forms a more desirable parametric model than lower-order (say linear) polynomials, the corresponding L-spline is preferred over a polynomial spline.