Skip to Main Content

Grace Wahba and the Wisconsin Spline School

Chong Gu
Yuedong Wang

Communicated by Notices Associate Editor Richard Levine

Article cover

Grace Wahba (née Goldsmith, born August 3, 1934) joined the Department of Statistics at the University of Wisconsin–Madison in 1967, and remained on its faculty until her retirement in 2019. During her luminous career at Madison of more than half a century, Grace graduated 39 doctoral students and has over 400 academic descendants to date. Over the years, Grace has worked on countless interesting problems, covering a broad spectrum from theory to applications. Above all, Grace is best known as the mother of the Wisconsin spline school, the primary driving force behind a rich family of data smoothing methods developed by herself, collaborators, and generations of students. Some of Grace’s life stories can be found in a recent entertaining piece by Nychka, Ma, and Bates 10. Here, we try to complement that piece with some technical discussions concerning the Wisconsin spline school.

As Grace’s former students, our first set of reading assignments consisted of the defining document of reproducing kernel Hilbert spaces by Aronszajn 1 and three early papers by Kimeldorf and Wahba 456. As we shall demonstrate below, some of those early results had far-reaching impacts on developments decades later. According to 10, many of the ideas of Kimeldorf and Wahba were inspired by discussions during tea parties at the Mathematics Research Center at UW Madison in the late 1960s/early 1970s, with participants including Issac Schoenberg, Carl de Boor, and Larry Schumaker.

In the sections to follow, we shall outline the smoothing spline approach to data smoothing, addressing numerous aspects and noting similarities and differences compared to related techniques. We highlight Grace’s many original contributions, but otherwise focus on the flow of presentation; for more accurate attributions of credit, one may consult the forward of 16 and the bibliographic notes in 3.

1. Smoothing Splines

Given pairs , , …, , , one may obtain a smoothing spline via the minimization of

The minimizer of 1 is a piecewise cubic polynomial, three times differentiable, with the third derivative jumping at the distinctive points.

In the mathematics or numerical analysis literature, a spline typically refers to a piecewise polynomial, one is concerned with function interpolation or approximation, and the pairs are exact samples satisfying .

With stochastic data, one does not have exact samples of the function and needs statistical models. A regression model behind 1 has , independent, and the minimizer of 1 provides an estimate of the regression function .

The first term in 1 pushes for a close fit of to the data and the second term penalizes the roughness of , with the smoothing parameter controlling the tradeoff between the two conflicting goals. As , one approaches a simple linear regression line ; yields the minimum curvature interpolant.

Figure 1.

Cubic smoothing splines.

Graphic for Figure 1.  without alt text

Figure 1 illustrates the cubic smoothing spline of 1 applied to some data from a simulated motorcycle accident, found in R package MASS as data frame mcycle, where is time and is head acceleration. The data are in circles and the three lines correspond to , from rough to smooth (red, black, and green, respectively), with (black color) selected by cross validation via 15 using ssanova0 in R package gss; accounts for the mapping of onto in the software implementation.

Regression analysis, a primary tool of supervised learning, is widely used in applications. Traditional parametric regression was developed when data were scarce, restricting to low-dimensional function spaces with a few free parameters; this maintains sufficient samples per parameter to effectively control “variance” in estimation, but may incur serious model bias. As the sample size increases, “variance” is less of a concern and one looks to reduce model bias, yielding numerous non-parametric regression techniques that permit “flexible” forms of . The smoothing spline of 1 presents a tidy approach to non-parametric regression, tuning to control the effective dimension of model space.

Many non-parametric regression methods exist, and all perform equally well in one dimension. The real challenge is in the modeling/estimation of functions on multi-dimensional domains, and generalizations of 1 lead to unparalleled operational convenience and a rich collection of modeling tools.

1.1. Penalized likelihood method

The penalized likelihood method results from an abstraction of 1. To estimate a function of interest on a generic domain using stochastic data, one may minimize

where can be taken as the minus log likelihood of the data and is a quadratic functional quantifying the roughness of .

In a step towards an abstract , Kimeldorf and Wahba 456 considered on for some general differential operator , with being special cases yielding polynomial smoothing splines. More examples of and will follow shortly.

A smoothing spline is defined as the solution to a variational problem of the form given in 2. Depending on the configuration, it may or may not reduce to a piecewise polynomial.

The first term in 1 is proportional to the minus log likelihood of the Gaussian regression model stated above. Two more examples of follow.

Example 1 (Logistic regression).

Consider , where . One may use for the estimation of the logit function .

Example 2 (Density estimation).

Let be independent samples from a probability density on domain . To estimate , one may use

Example 1 is a special case of non-Gaussian regression. A variant of Example 2 was studied by Silverman 11.

Figure 2.

Density estimation.

Graphic for Figure 2.  without alt text

Shown in Figure 2 is an example of one-dimensional density estimation using ; the data are 272 waiting times between eruptions of the Old Faithful geyser, found in the R data frame faithful. Plotted are a cross-validated density estimate (see §3) using ssden in R package gss, along with a histogram of the data. This is an instance of unsupervised learning.

The penalized likelihood of 2 is in fact performing constrained maximum likelihood estimation,

using the Lagrange method, with being the Lagrange multiplier; see 3, Theorem 2.12, where is also quantified.

The null space of , , specifies a parametric model for . With a , data-adaptive through the selection of , one allows for to depart from the parametric model.

1.2. Reproducing kernel Hilbert spaces

The minimization of 2 is implicitly in the space or a subspace therein, and function evaluations typically appear in . To facilitate analysis and computation, one needs a metric and a geometry in the function space, and needs the evaluation functional to be continuous.

A reproducing kernel Hilbert space is a Hilbert space of functions on a domain in which the evaluation functional is continuous, , . A comprehensive theory of reproducing kernel Hilbert spaces can be found in 1.

By Riesz representation, there exists a reproducing kernel, a non-negative definite bivariate function dual to the inner product in , which satisfies , , .

A reproducing kernel Hilbert space can also be generated from its reproducing kernel , for which any non-negative definite function qualifies, as the “column space” . The corresponding inner product may or may not have an explicit expression, however.

For use in 2, one takes equipped with , where is the semi-inner product associated with the quadratic functional and is an inner product in the null space . One has a tensor-sum decomposition with being a full inner product in .

Example 3 (Cubic spline).

Consider, on , . For an inner product in , take . It follows that , .

The reproducing kernel in is known to be , where are scaled Bernoulli polynomials. One may further decompose for , with the respective reproducing kernels given by and .

Facts concerning tensor-sum decompositions of reproducing kernel Hilbert spaces can be found in 3, Theorem 2.5, and technical details of Example 3 are in Craven and Wahba 2.

The theory of reproducing kernel Hilbert spaces provides an abstract mathematical framework encompassing a great variety of problems. The abstract setting allows many important issues, such as the computation and the asymptotic convergence of the minimizers of 2, be treated in a unified fashion. As summarized in Grace’s 1990 monograph 16, much of her work up to that date, from approximation theory to spline smoothing, fit under the general framework.

1.3. Tensor product splines

A statistical model should be interpretable, which distinguishes it from mere function approximation or some black-box predictor/classifier. Two main challenges in the non-parametric modeling of multivariate data are weak interpretability and the curse of dimensionality, which might be alleviated via a hierarchical structure of the functional ANOVA decomposition.

Functional ANOVA decomposition

Consider a bivariate function on ; subscripts in brackets denote coordinates of a point on a multidimensional domain while ordinary subscripts are reserved for multiple points. One may write

where is the identity operator, , are averaging operators acting on arguments , , respectively, that satisfy , , are main effects, and is the interaction, satisfying , . Similar constructions in more than two dimensions are straightforward.

Examples of averaging operators include on , on any domain; averaging operators on different axes are independent of each other.

For discrete, is a matrix of treatment means usually denoted by in a standard two-way ANOVA model, with 4 in the form

where for , for , and .

Selective term elimination in functional ANOVA decompositions helps to combat the curse of dimensionality in estimation and facilitates the interpretation of the analysis. For example, the so-called additive models, those containing only main effects, are easier to estimate and interpret than ones involving interactions. As with classical ANOVA models on discrete domains, the inclusion of higher-order interactions are to be avoided in practice.

For a log density, absence of selected interactions may imply (conditional) independence structures among random variables. Taking random variables on domain , say, a log density of the form implies the conditional independence of and given , or , where the notation for ANOVA terms of parallels that in 4.

Tensor product spaces

For the estimation of on product domains via 2, functional ANOVA decompositions can be built in through the construction of tensor product reproducing kernel Hilbert spaces.

Theorem 1.

Consider and . For non-negative definite on and non-negative definite on , is non-negative definite on .

To construct a reproducing kernel Hilbert space, it suffices to specify a reproducing kernel. The following example illustrates a construction on using the results of Example 3.

Example 4 (Tensor-product cubic spline).

The space is decomposed as in Example 3. We rewrite the decomposition as , and denote the respective reproducing kernels as , , and . A one-way ANOVA decomposition is built in, with satisfying .

Using these reproducing kernels on the two axes and taking products as in Theorem 1, one has nine reproducing kernels on , each yielding a tensor product space, as laid out below:

The ANOVA decomposition of 4 is built in, with , , etc.

Pasting these spaces together via tensor-sum, one has what is needed for use in 2. The four spaces on the upper-left corner are of one dimension each, and can be lumped into . The remaining five spaces can be put together as with a reproducing kernel

where the ’s are extra smoothing parameters adjusting the relative contribution of each component to the overall roughness measure.

To enforce an additive model, one removes from and sets .

The construction described in Example 4 is at an abstract level, requiring little specifics of cubic splines on . All one needs are reproducing kernels on marginal domains. Functional ANOVA decomposition follows trivially from one-way ANOVA decompositions on marginal domains.

Denote by the component spaces of in Example 4 and write , as the respective reproducing kernels and the associated square norms. With , the corresponding square norm in is given by , to be used in 2.

As an illustration, we fit a logistic regression model of Example 1 to the wesdr data found in R package gss, concerning the progression of diabetic retinopathy. One has and a binary , and the full model would include three main effects, three two-way interactions, and a three-way interaction. After some exploration, the interaction terms are found to be negligible so an additive model appears adequate, and one of the main effects is actually linear. We finally fit a model

where ; this is actually a partial spline model discussed at the end of this section. Plotted in Figure 3 are estimated and along with their Bayesian confidence intervals (see §4). The rugs on the ceiling and the floor mark the observed and , respectively, though does not depend on or alone; the confidence intervals do get wider where data are sparse. The analysis was done using the gssanova facilities in package gss. Further details can be found in 3, §5.5.3.

Figure 3.

Additive logistic regression.

Graphic for Figure 3.  without alt text

Grace’s signature was all over the tensor product spline technique, from the inception of the idea to the ensuing rigorous developments, involving several of her students including your authors; see, e.g., 16, Chap. 10 and 18.

1.4. More splines

Real intervals are the most encountered domains in practice, which can be mapped onto , and the cubic spline on with is the commonly used configuration for data smoothing in the setting. On domains other than real intervals or to accommodate various special needs, alternatives to cubic splines are sometimes called for.

We now present a variety of configurations tuned to various situations, which may be used directly in 2 on the respective designated domains, or be used as building blocks to construct tensor product splines.

Periodic splines

To accommodate recurring patterns such as circadian rhythms or seasonal effects, one may consider only periodic functions on a real interval. Mapping the interval onto and setting for periodic, one has , with component space reproducing kernels and ; see Craven and Wahba 2.

L-splines

On , one may configure an L-spline by setting , where is a general differential operator and is a weight function. With and , it reduces to a polynomial spline; yields the cubic spline.

When the null space forms a more desirable parametric model than lower-order (say linear) polynomials, the corresponding L-spline is preferred over a polynomial spline.

Example 5 (Exponential spline).

For , set and . The null space makes a reasonable model for a growth curve.

Transforming by , it can be shown that