# an Equivariant Neural Network?

Lek-Heng Lim

Communicated by Notices Associate Editor Emilie Purvine

We explain equivariant neural networks, a notion underlying breakthroughs in machine learning from deep convolutional neural networks for computer vision KSH12 to AlphaFold 2 for protein structure prediction JEP21, without assuming knowledge of equivariance or neural networks. The basic mathematical ideas are simple but are often obscured by engineering complications that come with practical realizations. We extract and focus on the mathematical aspects, and limit ourselves to a cursory treatment of the engineering issues at the end. We also include some materials with machine learning practitioners in mind.

Let and be sets, and a function. If a group acts on both and , and this action commutes with the function :

then we say that is -equivariant. The special case where acts trivially on is called -invariant. Linear equivariant maps are well-studied in representation theory and continuous equivariant maps are well-studied in topology. The novelty of equivariant neural networks is that they are usually nonlinear and sometimes discontinuous, even when and are vector spaces and the actions of are linear.

Equivariance is ubiquitous in applications where symmetries in the input space produce symmetries in the output space . We consider a simple example. An image may be regarded as a function , with each pixel assigned some RGB color . A simplifying assumption here is that pixels and colors can take values in a continuum. Let be the set of all images. Let the group act on via top-bottom reflection, i.e., is the image whose value at is . Let ,

Here and are the RGB encodings for pitch black and pure white respectively. So the map , transforms a color image into a black-and-white image. It does not matter whether we do a top-bottom reflection first or remove color first, the result is always the same, i.e., for all . Hence the decoloring map is -equivariant.

Our choice of an image with left-right symmetry presents another opportunity to illustrate the notion. If we choose coordinates so that the vertical axis passes through the center of the butterfly image, then as a function , it is invariant under the action of on via , i.e., . Note that the -equivariance of has nothing to do with this.

A takeaway of these examples is that nonlinear and discontinuous functions may very well be equivariant. However, the best known context for discussing equivariant maps is when is an intertwining operator, i.e., a linear map between vector spaces and equipped with a linear action of . In this case, an equivalent formulation of -equivariance takes the following form: Given linear representations of on and , i.e., homomorphisms and , a linear map is said to be -equivariant if

Intertwining operators preserve eigenvalues and, when is a Lie group, the action of its Lie algebra, properties that are crucial to their use in physics BH10.

Nevertheless the restriction to linear maps is unnecessary. The de Rham problem asks if and is merely required to be a homeomorphism, then does condition 1 imply that must be a linear map? De Rham conjectured this to be the case but it was disproved in CS81, launching a fruitful study of nonlinear similarity, i.e., nonlinear homeomorphisms with

in algebraic topology and algebraic K-theory. More generally, equivariant continuous maps under continuous group actions have been thoroughly studied in equivariant topology May96.

An equivariant neural network CW16 is an equivariant map constructed from alternately composing equivariant linear maps with nonlinear ones like the decoloring map above. That neural networks can be readily made equivariant is a consequence of two straightforward observations:

(i)

the composition of two -equivariant functions , is -equivariant;

(ii)

the linear combination of two -equivariant functions is -equivariant;

even when are nonlinear. Although an equivariant neural network is nonlinear, it uses intertwining operators as building blocks, and 1 plays a key role. In some applications like the wave function , the input or possibly some hidden layers may not be vector spaces; for simplicity we assume that they are and their -actions are linear.

In machine learning applications, the map is learned from data. A major advantage of requiring equivariance in a neural network is that it allows one to greatly narrow down the search space for the parameters that define . To demonstrate this, we begin with a simplified case that avoids group representations. A feed-forward neural network is a function obtained by alternately composing affine maps , , with a nonlinear function , i.e.,

giving . The depth, also known as the number of layers, is and the width, also known as the number of neurons, is . The simplifying assumption, which will be dropped later, is that our neural network has constant width throughout all layers. The nonlinear function is called an activation, with the ReLU (rectified linear unit) function for a standard choice. In a slight abuse of notation, the activation is extended to vector inputs by evaluating coordinatewise

In this sense, is called a pointwise nonlinearity. The affine function is defined by for some called the weight matrix and some called the bias vector. We do not include a bias in the last layer.

Although convenient, it is somewhat misguided to lump the bias and weight together in an affine function. Each bias is intended to serve as a threshold for the activation and should be part of it, detached from the weight that transforms the input. If one would like to incorporate translations, one may do so by going up one dimension, observing that . Hence, a better, but mathematically equivalent, description of would be as the composition

where we identify with the linear operator , , and for any we define by . We will drop the composition symbol to avoid clutter and write

as if it were a product of matrices. For example, with the aforementioned ReLU as ,

and plays the role of a threshold for activation as was intended in Ros58, p. 392 and MP43, p. 120.

A major computational issue with neural networks is the large number of unknown parameters, namely the entries of the weights and biases, that have to be fit with data, especially for wide neural networks where is large. To get an idea of the numbers involved in realistic situations, may be on the order of millions of pixels for image-based tasks, whereas is typically to layers deep. Computational cost aside, one may not have enough data to fit so many parameters. Thus, many successful applications of neural networks require that we identify, based on the problem at hand, an appropriate low-dimensional subset of from which we will find our weights . For example, for a signal processing problem, we might restrict to be Toeplitz matrices; the convolutional neural networks for image recognition in KSH12, an article that launched the deep learning revolution, essentially restrict to so called block-Toeplitz–Toeplitz-block or BTTB matrices. For 1D inputs with a single channel, i.e., inputs from , a general weight matrix requires parameters, whereas a Toeplitz one just needs parameters and an -banded Toeplitz one requires just . For 2D inputs with channels such as color images, i.e., inputs from , a general weight matrix would have required a staggering parameters, whereas a BTTB one just needs , and an local BTTB one requires just . These are all simplified versions of the convolutional layers in a convolutional neural networks, which are a quintessential example of equivariant neural networks CW16, and in fact every equivariant neural network may be regarded as a generalized convolutional neural network in an appropriate sense KT18.

To see how equivariance naturally restricts the range of possible , let be a matrix group. Then is -equivariant if

and an equivariant neural network is simply a feed-forward neural network that satisfies 4. The key to its construction is just that

and the last expression equals if we have

for all , and for all . The condition on the right is satisfied by any pointwise nonlinearity that takes the form in 3, i.e., has all coordinates equal to some ; we will elaborate on this later. The condition on the left limits the possible weights for to a (generally) much smaller subspace of matrices that commute with all elements of . Finding this subspace (in fact a subalgebra) of intertwining operators,

is a well-studied problem in group representation theory; a general purpose approach is to compute the null space of a matrix built from the generators of and, if continuous, its Lie algebra FWW21. We caution the reader that will generally be a very low-dimensional subset of , as will become obvious from our example below in 8. It will be pointless to pick, say, as the set in 6 will then be just , clearly too small to serve as meaningful weights for any neural network. Indeed, will usually be a homomorphic image of a representation , i.e., the image will play the role of in 6. In any case, we will need to bring in group representations to address a different issue.

In general, neural networks have different width in each layer:

with , , , . The simplified case treated above assumes that . It is easy to accommodate this slight complication by introducing group representations to equip every layer with its own homomorphic copy of . Instead of fixing to be some subgroup of , may now be any abstract group but we introduce a homomorphism

in each layer, and replace the equivariant condition 5 with the more general 1, i.e.,

or, equivalently,

for all . In case 7 evokes memories of Schur’s Lemma, we would like to stress that the representations are in general very far from being irreducible and that the map is nonlinear. Indeed the scenario described by Schur’s Lemma is undesirable for equivariant neural networks: As we pointed out earlier, we do not want to restrict our weight matrices to the form or a direct sum of these.

We summarize our discussion with a formal definition.

A word of caution is in order here. What we call a neural network MP43, i.e., the alternate composition of activations with affine maps, is sometimes also called a multilayer perceptron Ros58; a standard depiction is shown in Figure 1. When it is fit with data, one would invariably feed its output into a loss function and that is usually not equivariant; or one might chain together multiple units of multilayer perceptrons into larger frameworks like autoencoders, generative adversarial networks, transformers, etc, that contain other nonequivariant components. In the literature, the term “neural network” sometimes refers to the entire framework collectively. In our article, it just refers to the multilayer perceptron—this is the part that is equivariant.

We will use an insightful toy example as illustration. Let be the set of possible positions of unit-weight masses, , and compute the center of mass

with . We use the same system of coordinates for each copy of in and . If we work in a different coordinate system, the position of the center of mass remains unchanged but its coordinates will change accordingly. For simplicity, we consider a linear change of coordinates, represented by the action of a matrix on each point in . By linearity,

so is -equivariant. Since each mass has the same unit weight, is also invariant under permutations of the input points. Let , which acts on via and acts trivially on via . As the sum in 8 is permutation invariant,

so is -invariant. Combining our two group actions, we see that is -equivariant. Note that the group here is , which has much lower dimension than for large . This is typical in equivariant neural networks.

In this simple example, we not only know but have an explicit expression for it. In general, there are many functions that we know should be equivariant or invariant to certain group actions, but for which we do not know any simple closed-form expression; and this is where it helps to assume that is given by some neural network whose parameters could be determined by fitting it with data, or, if it is used as an ansatz, by plugging into some differential or integral equations. A simple data-fitting example is provided by semantic segmentation in images, which seeks to classify pixels as belonging to one of several types of objects. If we rotate or mirror an image, we expect that pixel labels should follow the pixels. A more realistic version of the center of mass example would be a molecule represented by positions of its atoms, which comes up in chemical property or drug response predictions. Here we want equivariance with respect to coordinate transformations, but we wish to preserve pairwise distances between atoms and chirality, so the natural group to use is KLT18 or the special Euclidean group WGW18FWFW20. The much-publicized protein structure prediction engine of DeepMind’s AlphaFold 2 relies on an -equivariant neural network and an -invariant attention module JEP21. In TEW21, -equivariant convolution is used to improve accuracy assessments of RNA structure models.

Another straightforward example comes from computational quantum chemistry, where one seeks a solution to a Schrödinger equation: if we write , then the wave function of identical spin- fermions is antisymmetric, i.e.,

for all . In other words, the increasingly popular antisymmetric neural networks HSN20 are -equivariant neural networks. Even without going into the details, the reader could well imagine that restricting to neural networks that are antisymmetric is a savings from having to consider all possible neural networks. More esoteric examples in particle physics call for Lorentz groups of various stripes like , , , or , which are used in Lorentz-equivariant neural networks to identify top quarks in data from high-energy physics experiments BAO20.

We now discuss the equivariant condition for pointwise nonlinearities . It is instructive to look at a simple numerical example. Suppose we apply a pointwise nonlinearity and a permutation matrix given by

to a vector . We see that :

which clearly holds more generally, i.e., for any permutation matrix and any pointwise nonlinearity . The bottom line is that the permutation matrix