Is deep learning a useful tool for the pure mathematician?

By Geordie Williamson

Abstract

A personal and informal account of what a pure mathematician might expect when using tools from deep learning in their research.

1. Introduction

Over the last decade, deep learning has found countless applications throughout industry and science. However, its impact on pure mathematics has been modest. This may be surprising, as some of the tasks at which deep learning excels—like playing the board game Go or finding patterns in complicated structures—appear to present similar difficulties to problems encountered in research mathematics. On the other hand, the ability to reason—probably the single most important defining characteristic of mathematical enquiry—remains a central unsolved problem in artificial intelligence. Thus, mathematics can be seen as an important litmus test as to what modern artificial intelligence can and cannot do.

There is great potential for interaction between mathematics and machine learning.⁠Footnote1 However, there is also a lot of hype, and it is easy for the mathematician to be put off. In my experience, it remains hard to use deep learning to aid my mathematical research. But it is possible. One also has the sense that the potential, once the right tools have been uncovered, is significant.

1

In 1948, Turing Reference Tur48, §6 identifies games, mathematics, cryptography and language translation and acquisition as five “suitable branches of thought” in which experimentation with machine intelligence might be fruitful.

This is a very informal survey of what a working mathematician might expect when using the tools of deep learning on mathematics problems. I outline some of the beautiful ideas behind deep learning. I also give some practical hints for using these tools. I finish with some examples where deep learning has been used productively in pure mathematics research. (I hope it goes without saying that the impact of deep learning on applied mathematics has been enormous.)

Finally, in my experience, the more one uses the tools of deep learning, the more difficult it becomes not to ask oneself foundational questions about why they work. This raises an entirely different set of questions. Although fascinating, the mathematical theory of deep learning is not the focus here.

Remark 1.1.

The elephant in the room of any discussion of deep learning today is the recent success of ChatGPT and other large language models. The internet is full of examples of ChatGPT doing both very well and very poorly on reasoning and mathematics problems. It seems likely that large language models will be able to interact well with proof assistants in the near future (see, e.g., Reference HRW21Reference JWZ23). It is also likely that a greater role will be played in mathematics research by very large models, possibly with emergent capabilities (“foundation models” in the language of the excellent Bommasani et al. Reference BHA21). The impacts of such developments on mathematics are difficult to predict. In this article I will ignore these questions entirely. Thus I will restrict myself to situations in which deep learning can be used by mathematicians without access to these large models.

2. What is a neural network?

Artificial neural networks emulate the biological neural networks present in the brains of humans and other animals. Typically, this emulation takes place on a computer. The idea of doing so is very natural. See Reference MP43Reference Tur48 for remarkable early accounts.

A cartoon picture of a neuron imagines it as a unit with several inputs and a single output, which may then be connected to other neurons:

Neurons “fire” by emitting electrical charge along their axon. We may encode the charges arriving along each node by a real number, in which case the charge emitted by a neuron is given by

where is a (typically monotone increasing and nonlinear) activation function. Soon we will assume that our activation function is fixed;⁠Footnote2 however, at this level of precision the reader is encouraged to imagine something like . The activation function is meant to model the nonlinear response curves of neurons to stimuli. For example, some neurons may not fire until a certain charge is reached at their source.⁠Footnote3

2

and equal to “ReLU”:

3

In biological neural nets there is typically large variation in the responses of neurons to stimuli depending on where they are in the brain (see, e.g., Reference HW62). This is one of the many features of biological neural nets that is usually ignored when building artificial neural networks.

Another important feature of neurons is that their firing may be excitatory or inhibitory of downstream neurons to varying degrees. In order to account for this, one allows modification of the input charges via weights (the ):

Thus positive and negative weights correspond to excitatory and inhibitory connections respectively.

Having settled on a crude mathematical model of a single neuron, we may then assemble them together to form a neural network:

Implicit in this picture is the assignment of a weight to each edge. Thus our neural network yields a function which takes real valued inputs (5 in the above picture), and outputs real values (2 above), via repeated application of Equation 1 at each node.

This is a good picture for the layperson to have in mind. It is useful to visualize the complex interconnectedness present in artificial neural networks, as well as the locality of the computation taking place. However, for the mathematician, one can explain things a little differently. The configuration

is simply a complicated way of drawing a matrix. In other words, we can rewrite our neural network above economically in the form

where the are linear maps determined by matrices of weights, and is shorthand for the coordinatewise application of our activation function .

For the purposes of this article, a vanilla neural networkFootnote4 is a gadget of the form

4

One often encounters the term “Multilayer Perceptron” (MLP) in the literature.

where are affine linear maps. We refer to as the layers of the network. In order to simplify the discussion, we always assume that our activation function is given by ReLU (the “rectified linear unit”), that is,

where the are standard basis vectors.

Remark 2.1.

We make the following remarks:

(1)

The attentive reader might have observed a sleight of hand above, where we suddenly allowed affine linear maps in our definition of a vanilla neural net. This can be justified as follows: In biological neural nets both the charge triggering a neuron to fire, as well as the charge emitted, varies across the neural network. This suggests that each activation function should have parameters, i.e., be given by for varying at each node. Things just got a lot more complicated! Affine linear maps circumvent this issue: by adding the possibility of affine linear maps one gets the same degree of expressivity with a much simpler setup.

(2)

We only consider ReLU activation functions below. This is one of the standard choices, and provides a useful simplification. However, one should not forget that it is possible to vary activation functions.

(3)

We have tried to motivate the above discussion of neural networks as some imitation of neural activity. It is important to keep in mind that this is a very loose metaphor at best. However, I do find it useful in understanding and motivating basic concepts. For an excellent account along these lines by David Mumford, the reader is referred to Reference Mum20.

(4)

The alert reader will notice that we have implicitly assumed above that our graphs representing neural networks do not have any cycles or loops. Again, this is a simplification, and it is desirable in certain situations (e.g., in recurrent neural networks) to allow loops.

Vanilla neural networks are often referred to as fully-connected because each neuron is connected to every neuron in the next layer. This is almost opposite to the situation encountered in the brain, where remarkably sparse neural networks are found. The connection pattern of neurons is referred to the architecture of the neural network. As well as vanilla neural networks, important artificial neural network architectures include convolutional neural networks, graph neural networks, and transformers. Constraints of length prohibit us from discussing these architectures in any depth.

Remark 2.2.

More generally, the term “neural network” is often used to refer to any program in which the output depends in a smooth way on the input (and thus the program can be updated via some form of gradient descent). We ignore this extra generality here.

3. Motivation for deep learning

In order to understand deep learning, it is useful to keep in mind the tasks at which it first excelled. One of the most important such examples is image classification. For example, we might want to classify hand-written digits:

Here each digit is given as (say) a matrix of grayscale values between 0 and 255. This is a task which is effortless for us, but is traditionally difficult for computers.

We can imagine that our brain contains a function which sees a hand-written digit and produces a probability distribution on , i.e., “what digit we think it is”.⁠Footnote5 We might attempt to imitate this function with a neural network.

5

I can convince myself that my brain produces a probability distribution and not a yes/no answer by recalling my efforts to decipher my grandmother’s letters when I was a child.

Let us consider a simpler problem in which we try to decide whether a hand-written digit is a 6 or not:

We assume that we have “training data” consisting of images labelled by “6” or “not 6”. As a first attempt we might consider a network having a single linear layer:

Here is affine linear, and the second function (the “logistic function”)⁠Footnote6 is a convenient way of converting an arbitrary real number into a probability. Thus, positive values of mean that we think our image is a 6, and negative values of mean we think it is not.

6

a.k.a. sigmoid in the machine learning literature

We will be successful if we can find a hyperplane separating all vectors corresponding to sixes (red dots) from those that do not represent sixes (blue dots):

Of course, it may not be possible to find such a hyperplane. Also, even if we find a hyperplane separating red and blue dots, it is not clear that such a rule would generalize to correctly predict whether an unseen image (i.e., image not in our training data) represents a 6 or not. Remarkably, techniques of this form (for example logistic regression, Support Vector Machines (SVMs), …) do work in many simple learning scenarios. Given training data (e.g., a large set of vectors labelled with “yes” and “no”) the optimal separating hyperplane may be found easily.⁠Footnote7

7

For a striking mathematical example of support vector machines see Reference HK22, where SVMs are trained to distinguish simple and nonsimple finite groups, by inspection of their multiplication table.

4. What is deep learning?

In many classification problems the classes are not linearly separable.

Linear methods such as SVM can nevertheless still be used in many cases, after application of a suitable feature map, namely a (non-linear) transformation whose application on the data makes linear separation of classes possible:

It is on such more difficult learning tasks that deep learning can come into its own. The idea is that successive layers of the neural net transform the data gradually, eventually leading to an easier learning problem.⁠Footnote8

8

This idea seems to have been present in the machine learning literature for decades, see, e.g., Reference LBBH98. It is well explained in Reference GBC16, §6. For illustrations of this as well as the connection to fundamental questions in topology, see the work of Olah Reference Ola14.

In the standard setting of supervised learning, we assume the existence of a function

and know a (usually large) number of its values. The task is to find a reasonable approximation of given these known values. (The reader should keep in mind the motivating problem of §3, where one wants to learn a function

giving the probabilities that a certain -pixel grayscale image represents one of the 10 digits , , …, .)

We fix a network architecture, which, in our simple setting of a vanilla neural net, means that we fix the number of layers and layer dimensions . We then build a neural net (see §2) which serves as our function approximator:

To begin with, the affine linear maps are initialized via some (usually random) initialization scheme, and hence the function output by our neural network will be random and have no relation to our target function . We then measure the distance between our function and via some loss function . (For example, might be the mean squared distance between the values of and .)⁠Footnote9 A crucial assumption is that this loss function is differentiable in terms of the weights of our neural network. Finally, we perform gradient descent with respect to the loss function in order to update the parameters in Equation 2 to (hopefully) better approximate .

9

There are many subtleties here, and a good choice of loss function is one of them. In my limited experience, neural networks do a lot better learning probability distributions than general functions. When learning probability distributions, cross entropy Reference GBC16, §3.13 is the loss function of choice.

In order to get an intuitive picture of what is happening during training, let us assume that (so we are trying to learn a scalar function), and that our activation functions are ReLU. Thus is the composition of affine linear and piecewise linear functions, and hence is piecewise linear. As with any piecewise linear function, we obtain a decomposition of into polytopal regions

such that is affine linear on each region. As training progresses, the affine linear functions move in a similar way to the learning of a line of best fit, but more complex since the regions we are dealing with may also move, disappear, or spawn new regions.

Remark 4.1.

For an excellent interactive animation of a simple neural network learning a classification task, the reader is urged to experiment with the Tensor Flow Playground Reference SC. Karpathy’s convolutional neural network demo Reference Kar is also illustrative.

Remark 4.2.

Some remarks:

(1)

Typically, one splits the known values of into two disjoint sets consisting of training data and validation data. Steps of gradient descent are only performed using the training data, and the validation data allows us to periodically check whether our model is also making reasonable predictions at points not present in the training data (“validation error”). It is sometimes useful to have an additional set of test data, completely unseen during training, which one can use to compute the performance of the trained model (“test error”).

(2)

In most applications of machine learning, the training data is enormous and feeding it all through the neural network Equation 2 in order to compute the loss function is unduly expensive. Thus one usually employs stochastic gradient descent: at every step the gradient of the loss function is computed using a small random subset (a “minibatch”) of the training data.

(3)

Using a model with a small number of parameters (as traditionally done in statistics and some ML methods) has advantages for interpretability and computation. It can also help avoid overfitting, where the chosen predictor may fit the data set so closely it ends up fitting noise and fails to adequately capture the underlying data-generating process. Deep learning is different in that often there are enough parameters to allow overfitting. What is surprising is that often neural nets generalize well (i.e,. don’t overfit) even though they could in principle (this is an enormous subject, see, e.g., Reference BHMM19).

5. Simple examples from pure mathematics

It is important to keep in mind that the main motivating applications for deep learning research are very different from those arising in pure mathematics. For example, the “recognize a hand-written digit” function considered in the previous two sections is rather different to the Riemann zeta function!⁠Footnote10

10

One should keep in mind that neural networks are universal approximators: a large enough neural network can approximate any continuous function accurately Reference Wik. However, in practice some functions are much more easily learnt than others.

This means that the mathematician wanting to use machine learning should keep in mind that they are using tools designed for a very different purpose. The hype that “neural nets can learn anything” also doesn’t help. The following rules of thumb are useful to keep in mind when selecting a problem for deep learning:

(1)

Noise stable. Functions involved in image and speech recognition motivated much research in machine learning. These functions typically have very high-dimensional input (e.g., for a square grayscale image) and are noise stable. For example, we can usually recognise an image or understand speech after the introduction of a lot of noise. Neural nets typically do poorly on functions which are very noise-sensitive.⁠Footnote11

11

This point should be read with some caution. For example, evaluation of board positions in Go is not a particularly noise-stable problem.

(2)

High dimensional. If one thinks of a neural network as a function approximator, it is a function approximator that comes into its own on high-dimensional input. These are the settings in which traditional techniques like Fourier series break down due to the curse of dimensionality. Deep learning should be considered when the difficulty comes from the dimensionality, rather than from the inherent complexity of the function.

(3)

Unit cube. Returning to our (unreliable) analogy with biological neural nets, one expects all charges occurring in the brain to belong to some fixed small interval. The same is true of artificial neural networks: they perform best when all real numbers encountered throughout the network from input to output belong to some bounded interval. Deep learning packages are often written assuming that the inputs belong to the unit cube .

(4)

Details matter. Design choices like network architecture and size, initialization scheme, choice of learning rate (i.e., step size of gradient descent), choice of optimizer, etc., matter enormously. It is also important how the inputs to the neural network are encoded as vectors in (the representation).⁠Footnote12 Overcoming these difficulties is best done with a collaborator who has experience in deep learning research and implementation.

12

It seems silly to have to write that details matter in any technical subject. However, many people I have spoken to are under the false impression that one model works for everything, and that training happens “out of the box” and is easy. For an excellent and honest summary by an expert of the difficulties encountered when training large models, see Reference Kar19.

With these rules of thumb in mind we will now discuss three examples in pure mathematics.

5.1. Learning the parity bit

Consider the parity bit function

We might be tempted to use a neural network to try to learn a function

which agrees with under the natural embedding .

This is a classic problem in machine learning Reference MP17, I §3.1. It generalizes the problem of learning the XOR function (the case ), which is one of the simplest problems that cannot be learned without nonlinearities. There exist elegant neural networks extending to the unit cube, and given a large proportion (e.g., 50%) of the set a neural network can be trained to express Reference RHW85, pp. 14–16. However, given only a small proportion of the values of (e.g., for ) a vanilla neural network will not reliably generalize to all values of (for experiments, see Reference GGW22, ‘Playing with parity’).

The issue here is that is highly noise sensitive. (Indeed, is precisely the checksum of signal processing!) This is an important example to keep in mind, as many simple functions in pure mathematics resemble . For an example, see Reference GGW22, Week 2 where we attempt (without much luck!) to train a neural network to learn the Möbius function from number theory.

5.2. Learning descent sets

Consider the symmetric group consisting of all permutations of . Given a permutation we can consider its left and right descent sets:

Obviously, and . The left and right descent sets are important invariants of a permutation.

It is interesting to see whether a neural network can be trained to learn the left and right descent sets. In other words, we would like to train a neural network

which, given the vector returns a sequence of probabilities given whether or not belongs to the left (resp., right) descent set.

This example is interesting in that Equation 5 implies that the right descent set can be predicted perfectly with a single linear layer. More precisely, if we consider

then the th coordinate of evaluated on a permutation is positive if and only if . On the other hand, it seems much harder to handcraft a neural network which extracts the left descent set from .

This might lead us to guess that a neural network will have a much easier time learning the right descent set than the left descent set. This turns out to be the case, and the difference is dramatic: a vanilla neural network with two hidden layers of dimensions 500 and 100 learns to predict right descent sets for with high accuracy after a few seconds. Whereas the same network struggles to get even a single correct answer for the left descent set, after significant training!⁠Footnote13 It is striking that using permutation matrices as inputs rather than the vectors gives perfect symmetry in training between left and right.⁠Footnote14 The issue here is the representation: how the model receives its input can have a dramatic effect on model performance.

13

Decreasing and allowing longer training suggests that the network can learn the left descent set, but it is much harder.

14

For a colab containing all of these experiments, see Reference GGW22, Classifying descent sets in .

5.3. Transformers and linear algebra

Our final example is much more sophisticated, and illustrates how important the choice of training data can be. It also shows how surprising the results of training large neural networks can be.

A transformer is a neural network architecture which first emerged in machine translation Reference VSP17. We will not go into any detail about the transformer architecture here, except to say that it is well suited to tasks where the input and output are sequences of tokens (“sequence to sequence” tasks):

More precisely, the input sequence (“xyz”) determines a probability distribution over all tokens. We then sample from this distribution to obtain the first token (“a”). Now the input and sequence sampled so far (“xyz” + “a”) provides a new distribution over tokens, from which we sample our second token (“b”), etc.

In a recent work Reference Cha21 Charton trains a transformer to perform various tasks in linear algebra: matrix transposition, matrix addition, matrix multiplication, determination of eigenvalues, determination of eigenvectors, etc. For example, the eigenvalue task is regarded as the “translation”:

Charton considers real symmetric matrices, all of whose entries are signed floating point numbers with three significant figures and exponent lying between and .⁠Footnote15 The transformer obtains impressive accuracy on most linear algebra tasks. What is remarkable is that for the transformer the entries of the matrix (e.g., 3.14, -27.8, 0.000132, …) are simply tokens—the transformer doesn’t “know” that 3.14 is close to 3.13, or that both are positive; it doesn’t even “know” that its tokens represent numbers!

15

Charton considers various encodings of these numbers via sequences of tokens of various lengths, see Reference Cha21.

Another remarkable aspect of this work concerns generalization. A model trained on Wigner matrices (e.g., entries sampled uniformly from ) does not generalize well at all to matrices with positive eigenvalues. On the other hand, a model trained on matrices with eigenvalues sampled from a Laplace distribution (which has heavy tails) does generalize to matrices whose eigenvalues are all positive, even though it has not seen a single such matrix during training! The interested reader is referred to Charton’s paper Reference Cha21, Table 12 and his lecture on YouTube Reference Cha22.

6. Examples from research mathematics

We now turn to some examples where deep learning has been used in pure mathematics research.

6.1. Counterexamples in combinatorics

One can dream that deep learning might one day provide a mathematician’s “bicycle for the mind”: an easy to use and flexible framework for exploring possibilities and potential counterexamples. (I have certainly lost many days trying to prove a statement that turned out to be false, with the counterexample lying just beyond my mental horizon.)

We are certainly not there yet, but the closest we have come to witnessing such a framework is provided in the work of Adam Wagner Reference Wag21. He focuses on conjectures of the form: over all combinatorial structures , an associated numerical quantity is bounded by . He considers situations where there is some simple recipe for generating objects in , and that the numerical quantity is efficiently computable.

For example, a conjecture in graph theory states that for any connected graph on vertices, with largest eigenvalue and matching number we have

(It is not important for this discussion to know what the matching number or largest eigenvalue are!)

Wagner fixes an enumeration , of the edges in a complete graph on -vertices. Graphs are generated by playing a single-player game: the player is offered , etc., and decides at each point whether to accept or reject the edge, the goal being to minimize Equation 6. A move in the game is given by a -vector indicating edges that have been taken so far, together with a vector indicating which edge is under consideration. For example, when the pair indicates that edge number is under consideration, and that edges , , and have already been selected, and rejected. Moves are sampled according to a neural network

which (after application of sigmoid) gives the probability that we should take the edge under consideration.

Wagner then employs the cross entropy method to gradually train the neural network. A fixed (and large) number of graphs are sampled according to the neural network Equation 7. Then a fixed percentage (say, 10%) of the games resulting in the smallest values of the LHS of Equation 6 are used as training data to update the neural network Equation 7. (That is, we tweak the weights of the neural network to make decisions that result in graphs that are as close as possible to providing a counterexample to Equation 7.) We then repeat. This method eventually finds a counterexample to Equation 6 on 19 vertices. The evolution of graphs sampled from the neural network is shown in Figure 6.1—note how the neural network learns quickly that tree-like graphs do best. Exactly the same method works to discover counterexamples to several other conjectures in combinatorics; see Reference Wag21.

6.2. Conjecture generation

The combinatorial invariance conjecture is a conjecture in representation theory which was proposed by Lusztig and Dyer in the early 1980s Reference Bre04. To any pair of permutations in the symmetric group one may associate two objects: the Bruhat graph (a directed graph); and the Kazhdan–Lusztig polynomial (a polynomial in ), see Figure 6.2 for an example of both. The conjecture states that an isomorphism between Bruhat graphs implies equality between Kazhdan–Lusztig polynomials. A more optimistic version of this conjecture asks for a recipe which computes the Kazhdan–Lusztig polynomial from the Bruhat graph. One interesting aspect of this conjecture is that it is (to the best of my knowledge) a conjecture born of pure empiricism.

For the Bruhat graph, the definition is simple, but the resulting graph is complicated. On the other hand, the definition of the Kazhdan–Lusztig polynomial is complicated, however the resulting polynomial is simple. Thus, there is at least a passing resemblance to traditional applications of machine learning, where a simple judgement (e.g., “it’s a cat”) is made from complicated input (e.g., an array of pixels).

It is natural to use neural networks as a testing ground for this conjecture: if a neural network can easily predict the Kazhdan–Lusztig polynomial from the Bruhat graph, perhaps we can too! We trained a neural network to predict Kazhdan–Lusztig polynomials from the Bruhat graph. We used a neural network architecture known as a graph neural network, and trained the neural network to predict a probability distribution on the coefficients of , , , and .⁠Footnote16 The neural network was trained on Bruhat graphs, and achieved very high accuracy () after less than a day’s training. This provides reasonable evidence that there is some way of reliably guessing the Kazhdan–Lusztig polynomial from the Bruhat graph.

16

The coefficient of is known to always equal . In our training sets no coefficients of or higher occur.

It is notoriously difficult to go from a trained neural network to some kind of human understanding. One technique to do so is known as saliency analysis. Recall that neural networks often learn a piecewise linear function, and hence one can take derivatives of the learned function to try to learn which inputs have the most influence on a given output.⁠Footnote17 In our example, saliency analysis provided subgraphs of the original Bruhat graph which appeared to have remarkable “hypercube”-like structure (see Figure 6.3 and Reference DVB21, Figure 5a). After considerable work this eventually led to a conjecture Reference BBD22, which would settle the combinatorial invariance conjecture for symmetric groups if proven, and has stimulated research on this problem from pure mathematicians Reference GW23Reference BG23bReference BG23aReference BM23.

17

This technique is often called “vanilla gradient” in the literature. Apparently it is very brittle in real-world applications.

In a parallel development, Davies, Juhász, Lackenby, and Tomasev were able to use saliency analysis to discover a new relationship between the signature and hyperbolic invariants of knots Reference DJLT22. The machine learning background of both works is explained in Reference DVB21. It would be very interesting to find further examples where saliency leads to new conjectures and theorems.

6.3. Guiding calculation

Another area where deep learning has promise to impact mathematics is in the guiding of calculation. In many settings a computation can be done in many ways. Any choice will lead to a correct outcome, but choices may drastically effect the length of the computation. It is interesting to apply deep learning in these settings, as false steps (which deep learning models are bound to make) effects efficiency but not accuracy.

Over the last three years there have been several examples of such applications. In Reference PSHL20, the authors use a machine learning algorithm to guide selection strategies in Buchberger’s algorithm, which is a central algorithm in the theory of Gröbner bases in polynomial rings. In Reference Sim21, Simpson uses deep neural networks to simplify proofs in the classification of nilpotent semi-groups. In Reference HKS22, the authors use a deep neural network to predict computation times of period matrices, and use it to more efficiently compute the periods of certain hypersurfaces in projective space.

6.4. Prediction

Due to limitations of space, we cannot begin to survey all the work done in this infant subject. In particular, there has been much work (see, e.g., Reference BHH21Reference BCDL20) training neural networks to predict difficult quantities in mathematics (e.g., volumes of polytopes, line bundle cohomology, etc.).

7. Conclusion

The use of deep learning in pure mathematics is in its infancy. The tools of machine learning are flexible and powerful, but need expertise and experience to use. One should not expect things to work “out of the box”. Deep learning has found applications in several branches of pure mathematics including combinatorics, representation theory, topology and algebraic geometry. Applications so far support the thesis that deep learning most usefully aids the more intuitive (“system 1”) parts of the mathematical process: spotting patterns, deciding where counter-examples might lie, choosing which part of a calculation to do next. However, the possibilities do seem endless, and only time will tell.

Acknowledgments

My understanding of this landscape has benefitted enormously from discussions with Charles Blundell, Lars Buesing, Alex Davies, Joel Gibson, Georg Gottwald, Camilo Libedinsky, Sébastien Racaniere, Carlos Simpson, Grzegorz Swirszcz, Petar Veličković, Adam Wagner, Théophane Weber, and Greg Yang. Without their help this journey would have been slower and much more painful. This article is based on a lecture given at the Fields Institute Symposium on the future of mathematical research, which was held at the instigation of Akshay Venkatesh. I would like to thank the organizers for organising a wonderful and thought-provoking meeting, and in particular Maia Fraser for very useful feedback on an earlier version of this article.

Figures

Figure 6.1.

The evolution of graphs towards Wagner’s counterexample, from Reference Wag21, with permission.

Graphic without alt text
Figure 6.2.

Bruhat interval and Kazhdan–Lusztig polynomial for the pair of permutations and in , from Reference BBD22.

Figure 6.3.

Bruhat interval pre- and post-saliency analysis.

Graphic without alt text

Mathematical Fragments

Equation (1)
Equation (2)
Equations (4), (5)
Equation (6)
Equation (7)

References

Reference [BBD22]
Charles Blundell, Lars Buesing, Alex Davies, Petar Veličković, and Geordie Williamson, Towards combinatorial invariance for Kazhdan-Lusztig polynomials, Represent. Theory 26 (2022), 1145–1191, DOI 10.1090/ert/624. MR4510816,
Show rawAMSref \bib{ci}{article}{ label={BBD{\etalchar {+}}22}, author={Blundell, Charles}, author={Buesing, Lars}, author={Davies, Alex}, author={Veli\v {c}kovi\'{c}, Petar}, author={Williamson, Geordie}, title={Towards combinatorial invariance for Kazhdan-Lusztig polynomials}, journal={Represent. Theory}, volume={26}, date={2022}, pages={1145--1191}, review={\MR {4510816}}, doi={10.1090/ert/624}, }
Reference [BCDL20]
Callum R. Brodie, Andrei Constantin, Rehan Deen, and Andre Lukas, Machine learning line bundle cohomology, Fortschr. Phys. 68 (2020), no. 1, 1900087, 14, DOI 10.1002/prop.201900087. MR4059246,
Show rawAMSref \bib{brodie2020machine}{article}{ label={BCDL20}, author={Brodie, Callum R.}, author={Constantin, Andrei}, author={Deen, Rehan}, author={Lukas, Andre}, title={Machine learning line bundle cohomology}, journal={Fortschr. Phys.}, volume={68}, date={2020}, number={1}, pages={1900087, 14}, issn={0015-8208}, review={\MR {4059246}}, doi={10.1002/prop.201900087}, }
Reference [BG23a]
Grant Barkley and Christian Gaetz, Combinatorial invariance for elementary intervals, Preprint, arXiv:2303.15577, (2023).
Reference [BG23b]
Grant T. Barkley and Christian Gaetz, Combinatorial invariance for lower intervals using hypercube decompositions, Sém. Lothar. Combin. 89B (2023), Art. 86, 8. MR4659594,
Show rawAMSref \bib{Gaetz}{article}{ label={BG23b}, author={Barkley, Grant T.}, author={Gaetz, Christian}, title={Combinatorial invariance for lower intervals using hypercube decompositions}, journal={S\'{e}m. Lothar. Combin.}, volume={89B}, date={2023}, pages={Art. 86, 8}, review={\MR {4659594}}, }
Reference [BHA21]
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al., On the opportunities and risks of foundation models, Preprint, arXiv:2108.07258, (2021).
Reference [BHH21]
Jiakang Bao, Yang-Hui He, Edward Hirst, Johannes Hofscheier, Alexander Kasprzyk, and Suvajit Majumder, Hilbert series, machine learning, and applications to physics, Phys. Lett. B 827 (2022), Paper No. 136966, 8, DOI 10.1016/j.physletb.2022.136966. MR4385221,
Show rawAMSref \bib{bao2021polytopes}{article}{ label={BHH{\etalchar {+}}21}, author={Bao, Jiakang}, author={He, Yang-Hui}, author={Hirst, Edward}, author={Hofscheier, Johannes}, author={Kasprzyk, Alexander}, author={Majumder, Suvajit}, title={Hilbert series, machine learning, and applications to physics}, journal={Phys. Lett. B}, volume={827}, date={2022}, pages={Paper No. 136966, 8}, issn={0370-2693}, review={\MR {4385221}}, doi={10.1016/j.physletb.2022.136966}, }
Reference [BHMM19]
Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal, Reconciling modern machine-learning practice and the classical bias-variance trade-off, Proc. Natl. Acad. Sci. USA 116 (2019), no. 32, 15849–15854, DOI 10.1073/pnas.1903070116. MR3997901,
Show rawAMSref \bib{belkin2019reconciling}{article}{ label={BHMM19}, author={Belkin, Mikhail}, author={Hsu, Daniel}, author={Ma, Siyuan}, author={Mandal, Soumik}, title={Reconciling modern machine-learning practice and the classical bias-variance trade-off}, journal={Proc. Natl. Acad. Sci. USA}, volume={116}, date={2019}, number={32}, pages={15849--15854}, issn={0027-8424}, review={\MR {3997901}}, doi={10.1073/pnas.1903070116}, }
Reference [BM23]
Francesco Brenti, Kazhdan-Lusztig polynomials: history problems, and combinatorial invariance, Sém. Lothar. Combin. 49 (2002/04), Art. B49b, 30. MR2006565,
Show rawAMSref \bib{brentim}{article}{ label={BM23}, author={Brenti, Francesco}, title={Kazhdan-Lusztig polynomials: history problems, and combinatorial invariance}, journal={S\'{e}m. Lothar. Combin.}, volume={49}, date={2002/04}, pages={Art. B49b, 30}, review={\MR {2006565}}, }
Reference [Bre04]
Francesco Brenti, Kazhdan-Lusztig polynomials: history problems, and combinatorial invariance, Sém. Lothar. Combin. 49 (2002/04), Art. B49b, 30. MR2006565,
Show rawAMSref \bib{brenti}{article}{ label={Bre04}, author={Brenti, Francesco}, title={Kazhdan-Lusztig polynomials: history problems, and combinatorial invariance}, journal={S\'{e}m. Lothar. Combin.}, volume={49}, date={2002/04}, pages={Art. B49b, 30}, review={\MR {2006565}}, }
Reference [Cha21]
François Charton, Linear algebra with transformers, CoRR abs/2112.01898, (2021).
Reference [Cha22]
Francois Charton, Math with Transformers, https://www.youtube.com/watch?v=81o-Uiop5CA, October 2022, Accessed on 20 March, 2023.
Reference [DJLT22]
Alex Davies, András Juhász, Marc Lackenby, and Nenad Tomasev, The signature and cusp geometry of hyperbolic knots, Geometry and Topology, (2022).
Reference [DVB21]
Alex Davies, Petar Veličković, Lars Buesing, Sam Blackwell, Daniel Zheng, Nenad Tomašev, Richard Tanburn, Peter Battaglia, Charles Blundell, András Juhász, Marc Lackenby, Geordie Williamson, Demis Hassabis, and Pushmeet Kohli, Advancing mathematics by guiding human intuition with AI, Nature 600 (2021), no. 7887, 70–74.
Reference [GBC16]
Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep learning, Adaptive Computation and Machine Learning, MIT Press, Cambridge, MA, 2016. MR3617773,
Show rawAMSref \bib{goodfellow2016deep}{book}{ label={GBC16}, author={Goodfellow, Ian}, author={Bengio, Yoshua}, author={Courville, Aaron}, title={Deep learning}, series={Adaptive Computation and Machine Learning}, publisher={MIT Press, Cambridge, MA}, date={2016}, pages={xxii+775}, isbn={978-0-262-03561-3}, review={\MR {3617773}}, }
Reference [GGW22]
Joel Gibson, Georg Gottwald, and Geordie Williamson, Machine Learning for the Working Mathematician, https://sites.google.com/view/mlwm-seminar-2022, June 2022, Accessed on 20 March, 2023.
Reference [GW23]
Maxim Gurevich and Chuijia Wang, Parabolic recursions for Kazhdan–Lusztig polynomials and the hypercube decomposition, Preprint, arXiv:2303.09251 (2023).
Reference [HK22]
Yang-Hui He and Minhyong Kim, Learning algebraic structures: Preliminary investigations, International Journal of Data Science in the Mathematical Sciences (2022), 1–20.
Reference [HKS22]
Kathryn Heal, Avinash Kulkarni, and Emre Can Sertöz, Deep learning Gauss-Manin connections, Adv. Appl. Clifford Algebr. 32 (2022), no. 2, Paper No. 24, 41, DOI 10.1007/s00006-022-01207-1. MR4386394,
Show rawAMSref \bib{heal2022deep}{article}{ label={HKS22}, author={Heal, Kathryn}, author={Kulkarni, Avinash}, author={Sert\"{o}z, Emre Can}, title={Deep learning Gauss-Manin connections}, journal={Adv. Appl. Clifford Algebr.}, volume={32}, date={2022}, number={2}, pages={Paper No. 24, 41}, issn={0188-7009}, review={\MR {4386394}}, doi={10.1007/s00006-022-01207-1}, }
Reference [HRW21]
Jesse Michael Han, Jason Rute, Yuhuai Wu, Edward W Ayers, and Stanislas Polu, Proof artifact co-training for theorem proving with language models, Preprint arXiv:2102.06203 (2021).
Reference [HW62]
David H Hubel and Torsten N Wiesel, Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex, The Journal of physiology 160 (1962), no. 1, 106.
Reference [JWZ23]
Albert Q. Jiang, Sean Welleck, Jin Peng Zhou, Wenda Li, Jiacheng Liu, Mateja Jamnik, Timothée Lacroix, Yuhuai Wu, and Guillaume Lample, Draft, sketch, and prove: Guiding formal theorem provers with informal proofs, 2023.
Reference [Kar]
Andrej Karpathy, Convnet Javascript Demo, https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html, Accessed on 18 March, 2023.
Reference [Kar19]
Andrej Karpathy, A Recipe for Training Neural Networks, http://karpathy.github.io/2019/04/25/recipe/, Apr 25, 2019, Accessed on 20 March, 2023.
Reference [LBBH98]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 (1998), no. 11, 2278–2324.
Reference [MP43]
Warren S. McCulloch and Walter Pitts, A logical calculus of the ideas immanent in nervous activity, Bull. Math. Biophys. 5 (1943), 115–133, DOI 10.1007/bf02478259. MR10388,
Show rawAMSref \bib{mccullochpitts}{article}{ label={MP43}, author={McCulloch, Warren S.}, author={Pitts, Walter}, title={A logical calculus of the ideas immanent in nervous activity}, journal={Bull. Math. Biophys.}, volume={5}, date={1943}, pages={115--133}, issn={0007-4985}, review={\MR {10388}}, doi={10.1007/bf02478259}, }
Reference [MP17]
Marvin Minsky and Seymour A Papert, Perceptrons, reissue of the 1988 expanded edition with a new foreword by léon bottou: An introduction to computational geometry, MIT press, 2017.
Reference [Mum20]
David Mumford, The Astonishing Convergence of AI and the Human Brain, October 1, 2020, Accessed on 13 March, 2023.
Reference [Ola14]
Christopher Olah, Neural Networks, Manifolds, and Topology, https://colah.github.io/posts/2014-03-NN-Manifolds-Topology/, April 6, 2014, Accessed on 18 March, 2023.
Reference [PSHL20]
Dylan James Peifer, Reinforcement Learning in Buchberger’s Algorithm, ProQuest LLC, Ann Arbor, MI, 2021. Thesis (Ph.D.)–Cornell University. MR4297517,
Show rawAMSref \bib{peifer2020learning}{book}{ label={PSHL20}, author={Peifer, Dylan James}, title={Reinforcement Learning in Buchberger's Algorithm}, note={Thesis (Ph.D.)--Cornell University}, publisher={ProQuest LLC, Ann Arbor, MI}, date={2021}, pages={170}, isbn={979-8516-92187-2}, review={\MR {4297517}}, }
Reference [RHW85]
David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams, Learning internal representations by error propagation, Tech. report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
Reference [SC]
Daniel Smilkov and Shan Carter, The Tensorflow Playground, https://playground.tensorflow.org, Accessed on 18 March, 2023.
Reference [Sim21]
Carlos Simpson, Learning proofs for the classification of nilpotent semigroups, Preprint, arXiv:2106.03015, (2021).
Reference [Tur48]
A. M. Turing, Intelligent machinery, a heretical theory, Philos. Math. (3) 4 (1996), no. 3, 256–260, DOI 10.1093/philmat/4.3.256. MR1406760,
Show rawAMSref \bib{turing}{article}{ label={Tur48}, author={Turing, A. M.}, title={Intelligent machinery, a heretical theory}, journal={Philos. Math. (3)}, volume={4}, date={1996}, number={3}, pages={256--260}, issn={0031-8019}, review={\MR {1406760}}, doi={10.1093/philmat/4.3.256}, }
Reference [VSP17]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017).
Reference [Wag21]
Adam Zsolt Wagner, Constructions in combinatorics via neural networks, Preprint, arXiv:2104.14516, (2021).
Reference [Wik]
Wikipedia, Universal approximation theorem, https://en.wikipedia.org/wiki/Universal_approximation_theorem, Accessed on 17 May, 2023.

Article Information

MSC 2020
Primary: 68T07 (Artificial neural networks and deep learning), 05E10 (Combinatorial aspects of representation theory)
Author Information
Geordie Williamson
Sydney Mathematical Research Institute, University of Sydney, A14 - Quadrangle, NSW 2006, Sydney, Australia
g.williamson@sydney.edu.au
ORCID
MathSciNet
Additional Notes

I am a pure mathematician, working mostly in geometric representation theory and related fields. I began an ongoing collaboration with DeepMind in 2020 on possible interactions of machine learning and mathematics, and have been fascinated by the subject ever since.

Journal Information
Bulletin of the American Mathematical Society, Volume 61, Issue 2, ISSN 1088-9485, published by the American Mathematical Society, Providence, Rhode Island.
Publication History
This article was received on and published on .
Copyright Information
Copyright 2024 American Mathematical Society
Article References
  • Permalink
  • Permalink (PDF)
  • DOI 10.1090/bull/1829
  • MathSciNet Review: 4726992
  • Show rawAMSref \bib{4726992}{article}{ author={Williamson, Geordie}, title={Is deep learning a useful tool for the pure mathematician?}, journal={Bull. Amer. Math. Soc.}, volume={61}, number={2}, date={2024-04}, pages={271-286}, issn={0273-0979}, review={4726992}, doi={10.1090/bull/1829}, }

Settings

Change font size
Resize article panel
Enable equation enrichment

Note. To explore an equation, focus it (e.g., by clicking on it) and use the arrow keys to navigate its structure. Screenreader users should be advised that enabling speech synthesis will lead to duplicate aural rendering.

For more information please visit the AMS MathViewer documentation.