# Machine Learning and Invariant Theory

Ben Blum-Smith

## 1. Introduction

Modern machine learning has not only surpassed the state of the art in many engineering and scientific problems, but it also has had an impact on society at large, and will likely continue to do so. This includes deep learning, large language models, diffusion models, etc. In this article, we give an account of certain mathematical principles that are used in the definition of some of these machine learning models, and we explain how classical invariant theory plays a role in them. Due to space constraints we leave out many relevant references. A version of this manuscript with a longer set of references is available on arXiv 1.

In supervised machine learning, we typically have a training set , where are the data points and are the labels. A typical example is image recognition, where the are images and the are image labels (say, “cat” or “dog”), encoded as vectors. The goal is to find a function in a hypothesis space , that not only approximately interpolates the training data (), but also performs well on unseen (or held-out) data. The function is called the trained model, predictor, or estimator. In practice, one parametrizes the class of functions with some parameters varying over a space of parameter values sitting inside some ; in other words, . Then one uses local optimization (in ) to find a function in that locally and approximately minimizes a prespecified empirical loss function which compares a candidate function ’s values on the with the “true” target values . In other words, one approximately solves , and then takes .

Modern machine learning performs regressions on classes of functions that are typically overparameterized (the dimension of the space of parameters is much larger than the number of training samples), and in many cases, several functions in the hypothesis class can interpolate the data perfectly. Deep learning models can even interpolate plain noise, or fit images to random labels. Moreover, the optimization problem is typically nonconvex. Therefore the model performance is highly dependent on how the class of functions is parameterized and the optimization algorithms employed.

The parameterization of the hypothesis class of functions is what in deep learning is typically referred to as the architecture. In recent years, the most successful architectures have been ones that use properties or heuristics regarding the structure of the data (and the problem) to design the class of functions: convolutional neural networks for images, recurrent neural networks for time series, graph neural networks for graph-structured data, transformers, etc. Many of these design choices are related to the symmetries of the problem: for instance, convolutional neural networks can be translation equivariant, and transformers can be permutation invariant.

When the learning problem comes from the physical sciences, there are concrete sets of rules that the function being modeled must obey, and these rules often entail symmetries. The rules (and symmetries) typically come from coordinate freedoms and conservation laws 19. One classical example of these coordinate freedoms is the scaling symmetry that comes from dimensional analysis (for instance, if the input data to the model is rescaled to change everything that has units of kilograms to pounds, the predictions should scale accordingly). In order to do machine learning on physical systems, researchers have designed models that are consistent with physical law; this is the case for physics-informed machine learning, neural ODEs and PDEs, and equivariant machine learning.

Given data spaces and a group acting on both of them, a function is equivariant if for all and all . Many physical problems are equivariant with respect to rotations, permutations, or scalings. For instance, consider a problem where one uses data to predict the dynamics of a folding protein or uses simulated data to emulate the dynamics of a turbulent fluid. Equivariant machine learning restricts the hypothesis space to a class of equivariant functions. The philosophy is that every function that the machine learning model can express is equivariant, and therefore consistent with physical law.

Symmetries were used for machine learning (and in particular neural networks) in early works 16, and more recently they have been revisited in the context of deep learning. There are three main ways to implement symmetries. The simplest one parameterizes the invariant and equivariant functions with respect to discrete groups by averaging arbitrary functions over the group orbit 3. The second approach, explained in the next section, uses classical representation theory to parameterize the space of equivariant functions (see for instance 8). The third approach, the main point of this article, uses invariant theory.

As an example, we briefly discuss graph neural networks (GNNs), which have been a very popular area of research in the past couple of years. GNNs can be seen as equivariant functions that take a graph represented by its adjacency matrix and possible node features , and output an embedding so that for all permutation matrices. Graph neural networks are typically implemented as variants of graph convolutions or message passing, which are equivariant by definition. However, many equivariant functions cannot be expressed with these architectures. Several recent works analyze the expressive power of different GNN architectures in connection to the graph isomorphism problem.

Beyond graphs, equivariant machine learning models have been extremely successful at predicting molecular structures and dynamics, protein folding, protein binding, and simulating turbulence and climate effects, to name a few applications. Theoretical developments have shown the universality of certain equivariant models, as well as generalization improvements of equivariant machine learning models over nonequivariant baselines. There has been some recent work studying the inductive bias of equivariant machine learning, and its relationship with data augmentation. See 1 for a list of references on these topics.

## 2. Equivariant Convolutions and Multilayer Perceptrons

Modern deep learning models have evolved from the classical artificial neural network known as the perceptron. The multilayer perceptron model takes an input and outputs defined to be the composition of affine linear maps and nonlinear entry-wise functions. Namely,

where is the (fixed) entrywise nonlinear function and are affine linear maps to be learned from the data. The linear maps can be expressed as where and . In this example each function is defined by the parameters .

The first neural network that was explicitly equivariant with respect to a group action is the convolutional neural network. The observation is that if the input images are seen in the torus , the linear equivariant maps are cross-correlations (which in ML are referred to as convolutions) with fixed filters. The idea of restricting the linear maps to satisfy symmetry constraints was generalized to equivariance with respect to discrete rotations and translations, and to general homogenous spaces. Note that when working with arbitrary groups there are restrictions on the functions for the model to be equivariant.

Classical results in neural networks show that certain multilayer perceptrons can universally approximate any continuous function in the limit where the number of neurons goes to infinity. However, that is not true in general in the equivariant case. Namely, functions expressed as 1 where the are linear and equivariant may not universally approximate all continuous equivariant functions. In some cases, there may not even exist nontrivial linear equivariant maps.

One popular idea to address this issue is to extend this model to use equivariant linear maps on tensors. Now are linear equivariant maps (where the action in the tensor product is defined as the tensor product of the action in each component and extended linearly). Now the question is how can we parameterize the space of such functions to do machine learning? The answer is via Schur’s lemma.

A representation of a group is a map that satisfies (where is a vector space and , as usual, denotes the automorphisms of , that is, invertible linear maps ). A group action of on (written as ) is equivalent to the group representation such that . We extend the action to the tensor product so that the group acts independently in every tensor factor (i.e., in every dimension or mode), namely .

The first step is to note that a linear equivariant map corresponds to a map between group representations such that for all . Homomorphisms between group representations are easily parametrizable if we decompose the representations in terms of irreducible representations (aka irreps):

In particular, Schur’s Lemma says that a map between two irreps over is zero (if they are not isomorphic) or a multiple of the identity (if they are).

The equivariant neural-network approach consists in decomposing the group representations in terms of irreps and explicitly parameterizing the maps 9. In general, it is not obvious how to decompose an arbitrary group representation into irreps. However in the case where , the decomposition of a tensor representation as a sum of irreps is given by the Clebsch–Gordan decomposition:

The Clebsch–Gordan decomposition not only gives the decomposition of the right side of 3 but also it gives the explicit change of coordinates. This decomposition is fundamental for implementing the equivariant 3D point-cloud methods defined in 8 and other works (see references in 1). Moreover, recent work 6 shows that the classes of functions used in practice are universal, meaning that every continuous SO(3)-equivariant function can be approximated uniformly in compact sets by those neural networks. However, there exists a clear limitation to this approach: Even though decompositions into irreps are broadly studied in mathematics (plethysm), the explicit transformation that allows us to write the decomposition of tensor representations into irreps is a hard problem in general. It is called the Clebsch–Gordan problem.

## 3. Invariant Theory for Machine Learning

An alternative but related approach to the linear equivariant layers described above is the approach based on invariant theory, the focus of this article. In particular, the authors of this note and collaborators 18 explain that for some physically relevant groups—the orthogonal group, the special orthogonal group, and the Lorentz group—one can use classical invariant theory to design universally expressive equivariant machine learning models that are expressed in terms of the generators of the algebra of invariant polynomials. Following an idea attributed to B. Malgrange (that we learned from G. Schwarz), it is shown how to use the generators of the algebra of invariant polynomials to produce a parameterization of equivariant functions for a specific set of groups and actions.

To illustrate, let us focus on -equivariant functions, namely functions such that for all and all (for instance, the prediction of the position and velocity of the center of mass of a particle system). The method of B. Malgrange (explicated below) leads to the conclusion that all such functions can be expressed as

where are -invariant functions. Classical invariant theory shows that is -invariant if and only if it is a function of the pairwise inner products . So, in actuality, 4 can be rewritten

where are arbitrary functions. In other words, the pairwise inner products generate the algebra of invariant polynomials for this action, and every equivariant map is a linear combination of the vectors themselves with coefficients in this algebra.

In this article, we explicate the method of B. Malgrange in full generality, showing how to convert knowledge of the algebra of invariant polynomials into a characterization of the equivariant polynomial (or smooth) maps. In Section 4 we explain the general philosophy of the method, and in Section 5 we give the precise algebraic development, formulated as an algorithm to produce parametrizations of equivariant maps given adequate knowledge of the underlying invariant theory. In Section 6, we work through several examples.

We note that, for machine learning purposes, it is not critical that the functions are defined on invariant polynomials, nor that they themselves are polynomials. In the following, we focus on polynomials because the ideas were developed in the context of invariant theory; the arguments explicated below are set in this classical context. However, in 18, the idea is to preprocess the data by converting the tuple to the tuple of dot products , and then treat the latter as input to the ’s, which are then learned using a machine learning architecture of one’s choice. Therefore, the ’s are not polynomials but belong to whatever function class is output by the chosen architecture. Meanwhile, some recent works 2511 have proposed alternative classes of separating invariants that can be used in place of the classical algebra generators as input to the ’s, and may have better numerical stability properties. This is a promising research direction.

## 4. Big Picture

We are given a group , and finite-dimensional linear -representations and over a field . (We can take or .) We want to understand the equivariant polynomial maps . We assume we have a way to understand -invariant polynomials on spaces related to and , and the goal is to leverage that knowledge to understand the equivariant maps.

The following is a philosophical discussion, essentially to answer the question: why should it be possible to do this? It is not precise; its purpose is just to guide thinking. Below in Section 5 we show how to actually compute the equivariant polynomials given adequate knowledge of the invariants. That section is rigorous.

The first observation is that any reasonable family of maps (for example linear, polynomial, smooth, continuous, etc.) has a natural -action induced from the actions on and , and that the -equivariant maps in such a family are precisely the fixed points of this action, as we now explain. This observation is a standard insight in representation and invariant theory.

Let be the set of maps of whatever kind, and let (respectively ) be the group of linear invertible maps from (respectively ) to itself. Given and and , we define the map by

where and are the group homomorphisms defining the representations and . The algebraic manipulation to verify that this is really a group action is routine and not that illuminating. A perhaps more transparent way to understand this definition of the action as “the right one” is that it is precisely the formula needed to make this square commute:

It follows from the definition of this action that the condition is equivalent to the statement that is -equivariant. The square above automatically commutes, so is the same as saying that the below square commutes—

—and this is what it means to be equivariant.

An important special case of 6 is the action of on , the linear dual of . This is the case with trivial action, and for 6 reduces to . This is known as the contragredient action. We will utilize it momentarily with in the place of .

The second observation is that can be identified with functions from a bigger space to the underlying field by “currying,” and this change in point of view preserves the group action. Again, this is a standard maneuver in algebra. Specifically, given any map , we obtain a function , defined by the formula

Note that the function is linear homogeneous in . Conversely, given any function that is linear homogeneous in the second coordinate, we can recover a map such that , by taking to be the element of identified along the canonical isomorphism with the functional on that sends to —this functional is guaranteed to exist by the fact that is linear homogeneous in the second coordinate. An observation we will exploit in the next section is that the desired functional is actually the gradient of with respect to .

This construction gives an identification of with a subset of . Furthermore, there is a natural action of on , defined precisely by the above formula 6 with in place of , in place of , and trivial action on ;⁠Footnote1 and the identification described here preserves this action. Therefore, the fixed points for the -action on correspond with fixed points for the -action on , which are invariant functions (since the action of on is trivial).

1

The action of on is defined by acting separately on each factor; the action on is the contragredient representation defined above.

What has been achieved is the reinterpretation of equivariant maps first as fixed points of a -action, and then as invariant functions . Thus, knowledge of invariant functions can be parlayed into knowledge of equivariant maps.

## 5. Equivariance from Invariants

With the above imprecise philosophical discussion as a guide, Algorithm 1 shows how in practice to get from a description of invariant polynomials on , to equivariant polynomial (or smooth) maps . The technique given here is attributed to B. Malgrange; see 13, Proposition 2.2 where it is used to obtain the smooth equivariant maps, and 15, Proposition 6.8 where it is used to obtain holomorphic equivariant maps. Variants on this method are used to compute equivariant maps in 7, Sections 2.1–2.2, 12, Section 3.12, 4, Section 4.2.3, and 20, Section 4.

The goal of the algorithm is to provide a parametrization of equivariant maps. That said, the proof of correctness is constructive: as an ancillary benefit, it furnishes a method for taking an arbitrary equivariant map given by explicit polynomial expressions for the coordinates and expressing it in terms of this parametrization.

We now exposit in detail Algorithm 1 and its proof of correctness, in the case where is a polynomial map; for simplicity we take . The argument is similar for smooth or holomorphic maps, except that one needs an additional theorem to arrive at the expression 7 below. If is a compact Lie group, the needed theorem is proven in 14 for smooth maps, and in 10 for holomorphic maps over .

We begin with linear representations and of a group over . We take to be the contragredient representation to , defined above. (If is compact, we can work in a coordinate system in which the action of on is orthogonal, and then we may ignore the distinction between and , as discussed above in the case of .) We suppose we have an explicit set of polynomials that generate the algebra of invariant polynomials on the vector space (denoted as )—in other words, they have the property that any invariant polynomial can be written as a polynomial in these. We also assume they are bihomogeneous, i.e., independently homogeneous in and in . To reduce notational clutter we suppress the maps specifying the actions of on and (which were called and in the previous section), writing the image of (respectively , ) under the action of an element as (respectively , ). We suppose are degree in (so they are functions of alone), are degree in , and are degree in .

Now we consider an arbitrary -equivariant polynomial function

We let be arbitrary, and, as in the previous section, we construct the function

Equivariance of implies that is invariant:

From the invariance of , and the fact that generate the algebra of invariant polynomials on , we have an equality of the form

where is a polynomial. Note that do not depend on , while do.

We now fix and take the gradient of both sides of 7 with respect to , viewed as an element of .⁠Footnote2 Choosing dual bases for and for , and writing , we can express the operator acting on a smooth function explicitly by the formula

2

In the background, we are using canonical isomorphisms to identify with all its tangent spaces, and with .

Applying to the left side of 7, we get

so recovers from . (Indeed, this was the point.) Meanwhile, applying to the right side of 7, writing for the partial derivative of with respect to its th argument, and using the chain rule, we get

Combining these, we conclude

Now we observe that if , because in those cases is constant with respect to . But meanwhile, the left side of 8 does not depend on , and it follows the right side does not either; thus we can evaluate it at our favorite choice of ; we take . Upon doing this, also becomes for , because in these cases is homogeneous of degree at least in , so its partial derivatives with respect to the remain homogeneous degree at least in , thus they vanish at . Meanwhile, itself vanishes for , so that the st to th arguments of each vanish. Abbreviating

as