Skip to Main Content

Mathematical Foundations of Graph-Based Bayesian Semi-Supervised Learning

N. García Trillos
D. Sanz-Alonso
R. Yang

Communicated by Notices Associate Editor Richard Levine

Article cover

In recent decades, science and engineering have been revolutionized by a momentous growth in the amount of available data. However, despite the unprecedented ease with which data are now collected and stored, labeling data by supplementing each feature with an informative tag remains to be challenging. Illustrative tasks where the labeling process requires expert knowledge or is tedious and time-consuming include labeling X-rays with a diagnosis, protein sequences with a protein type, texts by their topic, tweets by their sentiment, or videos by their genre. In these and numerous other examples, only a few features may be manually labeled due to cost and time constraints. How can we best propagate label information from a small number of expensive labeled features to a vast number of unlabeled ones? This is the question addressed by semi-supervised learning (SSL).

This article overviews recent foundational developments on graph-based Bayesian SSL, a probabilistic framework for label propagation using similarities between features. SSL is an active research area and a thorough review of the extant literature is beyond the scope of this article.⁠Footnote1 Our focus will be on topics drawn from our own research that illustrate the wide range of mathematical tools and ideas that underlie the rigorous study of the statistical accuracy and computational efficiency of graph-based Bayesian SSL.

1

AMS Notices limits to the references per article; for this reason, we refer to vEH20GTKSSA20SAY20b for further pointers to the literature.

1. Semi-Supervised Learning (SSL)

Let us start by formalizing the problem setting. Suppose we are given

where are independent draws from a random variable with joint law , and are independent draws from the marginal law . We refer to the ’s as features and to the ’s as labels.⁠Footnote2 Due to the cost associated with labeling features, in SSL applications the number of labels is usually small relative to the number of features. The goal is then to propagate the few given labels to the collection of all given features. Precisely, we consider the problem of using all labeled and unlabeled data to estimate the conditional mean function

2

In machine learning, features and labels are often referred to as inputs and outputs, respectively.

at the given features . We call the labeling function. For ease of exposition, we will restrict our attention to regression and classification problems, where the labeled pairs are generated from:

with known in the regression setting. Regression and classification are prototypical examples of SSL tasks with real-valued and discrete-valued labels, respectively. To streamline the presentation, we focus on binary classification where there are only two distinct classes, labeled by and . Several probabilistic models for binary classification are reviewed in BLSZ18. Extensions to non-Gaussian noise or multi-class classification can be treated in a similar fashion.

As its name suggests, SSL lies between supervised and unsupervised learning. In supervised learning, the goal is to use labeled data to learn the labeling function , so that it may later be evaluated at new features. On the other hand, unsupervised learning is concerned with using unlabeled data to extract important geometric information from the feature space, such as its cluster structure. SSL leverages both labeled and unlabeled data in the learning procedure; all given features are used to recover the geometry of the feature space, and this geometric information is exploited to learn .

Task Labeled Unlabeled
Supervised \ding{55}
Unsupervised \ding{55}
Semi-supervised

Relying on unlabeled data to estimate the labeling function may seem counterintuitive at first. Indeed, the question of whether unlabeled data can enhance the learning performance in SSL has been widely debated, and different conclusions can be reached depending on the assumed relationship between the label generating mechanism and the marginal distribution of the features. We will tacitly adopt the smoothness assumption—often satisfied in applications—that similar features should receive similar labels. In other words, we assume that the labeling function varies smoothly along the feature space. Under such a model assumption, one can intuitively expect that unlabeled data may boost the learning performance: uncovering the geometry of the feature space via the unlabeled data facilitates defining a smoothness-promoting regularization procedure for the recovery of the labeling function. The main idea of the graph-based Bayesian approach to SSL is to use a graph-theoretical construction to turn pairwise similarities between features into a probabilistic regularization procedure.

2. Graph-Based Bayesian SSL

We will take a Bayesian perspective to learn the restriction of to the features, denoted

We view as a vector in , with coordinates . In the Bayesian approach, inference is performed using a posterior distribution over , denoted . The posterior density will be large for functions that are consistent with the given labeled data; and our belief that similar features should receive similar labels. The posterior density is defined by combining a likelihood function and a prior distribution that encode, respectively, these two requirements:

Here and elsewhere is used as a shortcut for all given labels. We next describe, in turn, the definition of likelihood and prior, followed by a discussion of how the posterior distribution is used to conduct inference in the Bayesian framework.

Likelihood function

The likelihood function encodes the degree of probabilistic agreement of a labeling function with the observed labels , based on the model defined by 1 and 2. The independence structure in 1 implies that the likelihood factorizes as

and the Gaussian and binomial distributional assumptions in 2 give that

Note that the features are not involved in the definition of the likelihood function; in particular, the likelihood function does not depend on the unlabeled data.

Prior distribution

The prior encodes the belief that the labeling function should take similar values at similar features. We thus seek to design the prior so that its density is large for functions that vary smoothly along the feature data. We will achieve this goal by defining the prior as a transformation of a Gaussian random vector as follows:

Here is a link function that ensures that in the classification setting the prior samples take coordinate-wise values in . Section 3 will discuss how to use graph-based techniques to define the covariance structure of the latent vector so that and are highly correlated if the features and are similar. We term the approach “graph-based” because we will view each feature as a node of a graph, and use a matrix of pairwise similarities between features to define weighted edges. Then, the covariance of will be defined using a graph-Laplacian to penalize certain discrete derivatives of . In applications, the pairwise similarities often take the form , where is a feature representation map that embeds the feature space in a suitable Euclidean space, and is a kernel function such as the squared exponential , with the Euclidean norm.

Note that labels are not used in the definition of the prior; instead, the prior is designed using pairwise similarities between all labeled and unlabeled features .

Bayesian inference

Bayes’s formula 3 combines likelihood and prior to obtain the posterior distribution, used to perform Bayesian inference. Before moving forward, notice again that the prior is constructed solely in terms of the features, whereas only the labels enter the likelihood function. This insight will be important in later sections.

The posterior density quantifies our degree of belief that is the true (restricted) labeling function that generated the given data. A natural point estimator for is hence the posterior mode,

The right-hand side showcases that the posterior mode, also known as the maximum a posteriori estimator, can be found by optimizing an objective function comprising a data misfit and a regularization term, defined by the log-likelihood function and the log-prior density, respectively. This observation reconciles the Bayesian approach with classical optimization methods that—without a probabilistic interpretation—recover the labeling function by minimizing an objective that comprises data misfit and regularization terms.

Under the Bayesian framework, however, the posterior mean and the posterior median can also be used as meaningful point estimators that can be robust to outliers or model misspecification. Moreover, in addition to enabling point estimation, the posterior distribution also allows one to quantify the uncertainty in the reconstruction of the labeling function by computing Bayesian confidence intervals, correlations, or quantiles. All these quantities can be expressed as expectations with respect to the posterior distribution. As will be detailed later, sampling algorithms such as Markov chain Monte Carlo may be used to approximate these posterior expectations.

Figure 1.

Posterior inference for a synthetic dataset. Each circle represents a feature. Labeled features are marked with red dots. Upper: posterior mean of probabilities of cluster assignment. Lower: the corresponding posterior standard deviations.

Graphic without alt text Graphic without alt text

Illustrative example

Figure 1 contains a synthetic binary classification toy example where the features are sampled from two disjoint semi-circles, corresponding to two classes. We are given features, with only of them labeled (marked with red dots), and aim to classify the unlabeled ones. Using a graph-based prior defined following the ideas in Section 3, we compute the posterior mean—which estimates the probability with which each data point belongs to the lower semi-circle—and the posterior standard deviation—which represents the uncertainty in the estimation. Thresholding the posterior mean at for classification, the results indicate very high accuracy, with lower uncertainty in the reconstruction near labeled features. This example demonstrates a typical scenario in SSL, where the geometric information carried by the unlabeled data helps to achieve good learning performance with few labels.

Outline

Having introduced the graph-based Bayesian formulation of SSL, we are ready to outline the questions that will be explored in the rest of this article, along with their practical motivation:

(1)

Prior Design: How to define the latent field so that the prior promotes smoothness, facilitates computationally efficient inference, and yields a posterior that achieves optimal estimation for a large class of labeling functions?

(2)

Posterior Continuum Limit: For a fixed number of labels, do the graph-based posteriors converge to a well-defined probability measure in the limit of a large number of features?

(3)

Posterior Sampling: Can we design sampling algorithms whose rate of convergence does not deteriorate in the large limit?

(4)

Posterior Contraction: In the joint limit where both and are allowed to grow, does the posterior distribution concentrate around the true labeling function? How should scale with to achieve optimal estimation?

These questions, along with their interrelations, will be considered in the next four sections. We will focus on graph-based Bayesian SSL, but related asymptotic analyses of SSL include NSZ09BHL21. An overarching theme in our Bayesian setting will be to guarantee that the prior and the posterior distributions are well defined in the limit of a large number of features (interpreted as graph nodes). This idea is formalized through the study of continuum limits that play an essential role in understanding the statistical performance of graph-based Bayesian SSL and the scalability of sampling algorithms.

In order to set the theory on a rigorous footing, we will adopt the manifold assumption that the features lie on a hidden low-dimensional manifold BN04 embedded in an Euclidean space; for the study of posterior contraction, will be assumed to be a smooth function defined in this manifold. We emphasize that the manifold setting is used only for theoretical insight, but the methodology is applicable beyond this setting. The manifold assumption is widely adopted in machine learning and high-dimensional statistics, and encapsulates the empirical observation that high-dimensional features often exhibit low-dimensional structure.

We end this section by showing, in a concrete application, the interpretation of features and labels, as well as the intuition behind manifold and smoothness assumptions. The MNIST dataset consists of images of hand-written digits from 0 to 9. We may want to classify images given labels with and . Each image is a -dimensional vector, but the space of digits has been estimated HA05 to have dimension around , and can be conceptualized as an -dimensional manifold embedded in . The smoothness assumption reflects the idea that images that are similar are likely to correspond to the same digit, and should therefore receive the same label. As in the synthetic example of Figure 1 we need to construct a suitable prior for functions over the features and study posterior sampling algorithms.

3. Prior Design

In this section we discuss the definition of the Gaussian random vector used to specify the prior . It will be convenient to think of as a random function over , or a random process discretely indexed by . We will denote by the value of the th coordinate of . Such notation will help highlight the analogies between the design of our discretely indexed random vector and the design of Gaussian processes (GPs) in Euclidean domains.

In GP methodology, it is important to impose adequate smoothness assumptions. For instance, in the popular Matérn class of GPs (to be defined shortly in Section 3.2) in Euclidean space, the mean square differentiability of sample paths is tuned by a smoothness parameter. However, here we seek to define a discretely indexed random vector over abstract features—not necessarily embedded in Euclidean space—and the usual notions of smoothness are not readily applicable. To circumvent this issue, we will rely on a matrix of pairwise similarities between features. At a high level, we would like to be a random function that varies smoothly over with respect to the pairwise similarities, i.e., the function values and for should be close if the similarity between and is high. In other words, if we view the features as a graph whose edge information is encoded in the similarity matrix , then we wish to be regular with respect to the graph structure of . Techniques from spectral graph theory will allow us to construct a random vector that fulfills this smoothness requirement.

3.1. GPs over graphs

Graph-Laplacians, reviewed here succinctly, will be central to our construction. Given the similarity matrix , let be the diagonal matrix with entries . The unnormalized graph-Laplacian is then the matrix . Several normalized graph-Laplacians can also be considered (see, e.g., vL07), but for our purpose we will focus on the unnormalized one. From the relation

we readily see that is positive semi-definite. Moreover, 6 implies that if we identify with a function over , then ’s that change slowly with respect to the similarities lead to smaller values of . This observation suggests considering Gaussian distributions of the form since its negative log density is proportional to 6 (up to an additive constant) and therefore favors those “smooth” ’s. However is singular, and the above Gaussian would be degenerate. To remedy this issue, and to further exploit the regularizing power of , we will instead consider

with and the identity matrix. Here we have two additional parameters and which enhance the modeling flexibility. Roughly speaking, and control, respectively, the inverse lengthscale and smoothness of the samples (interpreted as functions over the graph). To see this, we can write the Karhunen-Loève expansion of in 7

where are the eigenvalue-eigenvector pairs of with increasingly ordered eigenvalues. The eigenvectors become more oscillatory as increases, and therefore a larger —which implies faster decay of the coefficients—yields more regular sample paths, whereas a larger incorporates more essential frequencies and makes the sample paths more oscillatory. Figures 2a, 2b and 2c demonstrate this behavior for three sets of parameters when the ’s are sampled from the unit circle. Finally, to further enhance the modeling flexibility, one can replace with a vector with positive entries. Doing so introduces a form of nonstationary local behavior, as shown in Figure 2d, where increases from 1 (the left end) to 30 (the right end).

Figure 2.

Plots of samples of 7 for different ’s and ’s when the ’s are sampled from the unit circle. The second row unfolds the plots in the first to the interval for better visualization of the fluctuations.

(a)

Graphic without alt text
(b)

Graphic without alt text
(c)

Graphic without alt text
(d)

varying,

Graphic without alt text

3.2. Connection with Matérn GP

Besides the regularizing effect of the graph-Laplacian described above, the Gaussian distribution 7 is also motivated by a close connection to Matérn GPs on Euclidean spaces. To start with, recall that the Matérn covariance function takes the following form

for . Here is the gamma function and is the modified Bessel function of the second kind. The Matérn GP is a GP with the Matérn covariance function. It is a popular modeling choice in Bayesian methodology due to the flexibility offered by the three parameters that control, respectively, the marginal variance, sample path smoothness, and correlation lengthscale. As we will see shortly, it turns out that we can view the finite dimensional Gaussian 7 as a discrete analog of the Matérn GP where the parameters and play similar roles as our and .

The key connection is the stochastic partial differential equation (SPDE) representation of Matérn GP proved by Whi63, which says that the Matérn GP is the unique stationary solution to

where is the usual Laplacian and is a spatial white noise with unit variance. With this in mind, we can rewrite 7 in a similar fashion as

where . Now, ignoring the marginal variance in 10, one can immediately see 11 as a discrete analog of 10 under the relation and . In other words, we can interpret 7 as a Matérn GP over the graph .

3.3. Prior continuum limit

If we impose certain assumptions on the graph , it can be shown that our graph Matérn GP is not only a discrete analog of the usual Matérn GP, but a consistent approximation of certain continuum Matérn-type GPs. To formalize this statment, we rely on the manifold assumption that we had previously foreshadowed. Suppose now that the ’s are independently sampled from the uniform distribution in the manifold . We then have the following result (see GTSA18, Theorem 4.2 (1) and SAY20a, Theorem 4.2 for the formal version):

Result 3.1.

Under a manifold assumption, the graph Matérn GP 7 converges to a Matérn-type GP on provided that the similarity is suitably defined and the smoothness parameter is sufficiently large.

We next provide some further context for this result. First, the limiting Matérn-type GP on is defined by

where is the identity and is the Laplace-Beltrami operator (the manifold analog of the usual Laplacian) on . By convention, is a negative semi-definite operator, which explains the minus sign. Just as the connection between 7 and the SPDE representation of Matérn GP, we can see 12 as a manifold analog of Matérn GP defined by lifting 10. In particular, we can write a similar series representation of 12

in terms of the eigenpairs of . The eigenfunctions encode rich information about the geometry of and form a natural basis of functions over . Comparing 8 and 13, it is reasonable to expect that large convergence will hold provided that we have convergence of the corresponding eigenvalues and eigenfunctions. To achieve this, we need to carefully construct the similarity matrix so that the graph-Laplacian is a good approximation of . If we assume that is an -dimensional compact submanifold of , then this is indeed the case if we set

where is the volume of the -dimensional unit ball and is a user-chosen graph connectivity parameter satisfying

with if and otherwise. Small values of induce sparse graphs, which are easier to work and compute with; see Section 3.4 below. However, very small values of render graphs that are so weakly connected that they cannot induce any level of smoothness in the functions that are likely to be generated by the prior . It is thus important to set the connectivity appropriately in order to take advantage of sparsity while at the same time recovering the geometric information of . The specific lower bound in 15 characterizes the level of resolution of the implicit discretization of the manifold induced by the ’s. We require to be larger than this quantity to capture the geometry of the underlying manifold. Under these conditions on , it can be shown that converges spectrally towards . Other types of graphs such as -nearest neighbors and variable bandwidth graphs can also be employed, and recent work GT19 have shown spectral convergence in these settings.

3.4. Sparsity

So far we have discussed the construction of our prior from a modeling perspective, motivated by the regularizing power of the graph-Laplacian and the connection with usual Matérn GPs. We close this subsection by mentioning its sparseness. Notice that with our choice of weights 14, the similarity matrix —and hence the graph-Laplacian —are sparse. Indeed, one can show that for

the number of nonzero entries of is . Therefore, for small integer in 7, we are left with a Gaussian with sparse precision matrix, and numerical linear algebra methods for sparse matrices can be employed to speed-up computation. This is important for posterior inference algorithms that may require factorizing . Similar conclusions can be reached with -nearest neighbors graphs.

4. Posterior Continuum Limit

In this section we discuss the convergence of the posterior for large (and fixed ) towards a continuum posterior defined later. For now, it suffices to note that the continuum posterior is naturally characterized as a probability distribution over the space . When formalizing a notion of convergence for posteriors, a challenge arises: the measures and are probability measures defined over different spaces, i.e., and , respectively. In what sense should these measures be compared? In what follows, we present a possible solution to this question, which also arises in the rigorous analysis of continuum limits for prior distributions considered in the previous section.

4.1. Lifting to the space

In order to compare the measures and , we start by introducing a space where we can directly compare functions in with functions in . We let be the set:

In words, is the collection of pairs of the form , where is a probability measure over and is an element in . For us, the most important choices for are the empirical measure associated to the samples and the data generating distribution . We use the simplified notation and to denote the spaces for these two choices of . can be formally interpreted as a fiber bundle over the manifold : each possesses a corresponding fiber.

We endow with the following distance:

where represents the set of couplings between and —that is, the set of probability measures on whose first and second marginals are and , respectively—and denotes the geodesic distance in . It is possible to show that the metric is, indeed, a distance function. Moreover, the topology induced by in each fixed fiber coincides with the topology induced by the natural topology of the Hilbert space , a fact that motivates the notation , which suggests an -like convergence after transportation. We refer to TPK17 for further details.

We proceed to define a notion of convergence for the posteriors as . As discussed above, the space allows us to see and as subsets of the bigger common space . In turn, the measures and , as well as the measures and , can then be all interpreted as probability measures on the space . Using this “lifting” we can now interpret the statement as , as a statement about the weak convergence of probability measures in the metric space . Further properties of the space allow us to use a collection of theorems, such as Portmanteau’s and Prokhorov’s, to characterize convergence and compactness in the space .

After specifying the notion of convergence of towards , we can now present a result, rigorously stated in GTSA18.

Result 4.1.

Under a manifold assumption, the graph-based posterior converges to a continuum limit posterior over functions on , provided that the similarity is suitably defined and the smoothness parameter is sufficiently large.

Further context for this result will be given next.

4.2. Convergence of posteriors

Now that we have discussed the precise way in which we formalize the convergence of towards , we proceed to characterize and describe the tools used to deduce this convergence. For ease of exposition, we focus on the regression setting.

First, we notice that the posterior distribution , introduced in Section 2 via Bayes’s formula, can be characterized variationally. Indeed, is the solution to the optimization problem

where, for ,