Skip to Main Content

An Analytical and Geometric Perspective on Adversarial Robustness

Nicolás García Trillos
Matt Jacobs

Communicated by Notices Associate Editor Daniela De Silva

Article cover

…y luego se fueron el uno para el otro, como si fueran dos mortales enemigos.

Don Quixote; Chapter 8, Part 1.

1. Introduction

In the last ten years, neural networks have made incredible strides in classifying large data sets, to the point that they can now outperform humans in raw accuracy. However, the robustness of these systems is a completely different story. Suppose you were asked to identify whether a photo contained an image of a cat or a dog. You probably would have no difficulty at all; at worst, maybe you would only be tripped up by a particularly small or unusual Shiba Inu. In contrast, it has been widely documented that an adversary can convince an otherwise well-performing neural network that a dog is actually a cat (or vice-versa) by making tiny human-imperceptible changes to an image at the pixel level. These small perturbations are known as adversarial attacks and they are a significant obstacle to the deployment of machine learning systems in security-critical applications GSS14. The susceptibility to adversarial attacks is not exclusive to neural network models, and many other learning systems have also been observed to be brittle when facing adversarial perturbations of data.

The business of defending against adversarial attacks is known as adversarial robustness, robust training, or simply adversarial training (although we will mostly reserve the latter name for a specific optimization objective). There are many methods in the literature that can be used to build defenses against adversarial attacks, but here we will be particularly interested in methods that enforce robustness during model training. In these types of methods, the standard training process — driven primarily by accuracy maximization — is substituted by a training process that promotes robustness, typically through the use of a different optimization objective that factors in the actions of a well-defined adversary.

In this article, we give a brief overview of adversarial attacks and adversarial robustness and summarize some recent attempts to mathematically understand the process of robust training. Adversarial training and its mathematical foundations are active areas of research and a thorough review of its extant literature is beyond the scope of this article.⁠Footnote1 For this reason, we will focus our discussion around some important lines of research in the theory of adversarial robustness, some of which are based on our own research work in the field, which takes a distinctive analytic and geometric perspective. One of our goals is to convey the mathematical richness of the field and discuss some of the many opportunities that are available for mathematicians to contribute to the development and understanding of this important applied problem.


AMS Notices limits to the references per article; we refer to the references cited here for further pointers to the literature.

1.1. Basics of training learning models

Data classification or regression typically occurs over a product space of the form . Here is the data space or feature space, an abstract metric space containing the data points, while is the set of labels, usually a finite set for classification tasks or the real line for regression. In the remainder, we mostly focus our discussion on the classification problem. There, the goal is to construct a function that accurately partitions the data space into the possible classes contained in . The learner does this by searching for a function or where is the probability simplex with vertices . If we write , then each represents the learner’s confidence that the data point belongs to class . While a probabilistic classifier is typically the desired output of a learning task, note that one can always obtain a deterministic classifier by selecting the largest entry of any tuple.

To train a machine learning system, one typically needs a finite training set consisting of data pairs , i.e., feature vectors with their associated ground truth classification. One then minimizes a loss function over some chosen function space with the goal of finding a function that produces an accurate classification of the data. For instance, may be the space of all neural networks with a certain architecture, while the loss function typically has the form

where is a function that is small when gives high probability to the ground truth label , and large otherwise. In practice, it may only be possible to find a classifier that is a local minimizer of 1 over , though it is often possible to drive the loss function to nearly zero during training via stochastic gradient descent. Either way, well-trained classifiers typically perform well—at least in the absence of adversarial attacks.

1.2. Adversarial attacks

Given a trained classifier and a data point , what is the best way for an adversary to perturb to produce an incorrect classification? In order for this to be a nontrivial question, we must assume that there are some restrictions on how far the adversary can perturb . This restriction is known as the adversarial budget, and it plays a crucial role in both adversarial attacks and robust training. For our purposes, we will formulate the adversarial budget through a parameter and assume that the adversary can only produce a perturbed data point that lies in , the ball of radius centered at the original data point .

Returning to our question, if the adversary has full access to the function and knows that the correct label for is , then the most powerful attack for a given budget is found by replacing with any point satisfying

In practice, the adversary may not be able to find a point that exactly satisfies 2. However, when is a subspace of Euclidean space, a simpler approach that produces highly effective attacks is to perturb the data in the direction of steepest ascent for the loss function by choosing

or by considering the popular PGD attack

where denotes the coordinatewise sign of its input; see GSS14MMS18, and TT22 for a motivation for the PGD attack.

Regardless of how the adversary chooses its attack, there are two key takeaways from formulas 2, 3, and 4 that we would like to highlight. Firstly, we see that adversarial attacks are found by attempting to maximize the loss function with respect to data perturbations. In contrast, the learner trains the classifier by attempting to minimize the loss function among classifiers belonging to a chosen function space (typically a parametric family). Hence, the learner and adversary can be viewed as playing a two-player game where they compete to set the value of the loss function using the tools at their disposal (the learner first gets to choose , the adversary then gets to modify data); this connection to game theory will become more important shortly. Secondly, it should be clear from formulas 2, 3, and 4 that the effectiveness of adversarial attacks must stem from a certain lack of smoothness in the trained classifier . Indeed, if were say 1-Lipschitz, then an adversary with budget could not change the classification probabilities at any point by more than . Thus, attacks that can fool an image classifier by making human-imperceptible changes to pixel values must be exploiting a significant lack of regularity.

1.3. Adversarial training

In light of the above considerations, to stave off adversarial attacks one must find a way to construct classifiers with better regularity properties. The most classical (and perhaps most obvious) way to do this would be to replace the training objective 1 with a new objective

where is a term that promotes regularity. For instance, could constrain the Lipschitz constant of or could be some other gradient penalty term. This approach has a long history of success in inverse problems (in that setting one typically adds a regularizing term to a data-fitting term to help mitigate the effect of noise), however, in the context of machine learning, it is often too difficult to efficiently minimize 5. For instance, it is very difficult to train a neural network with a Lipschitz constant constraint. On the other hand, popular and computationally feasible regularization terms for neural network training, for instance, weight regularization, do not seem to provide any defense against adversarial attacks.

Adversarial training is a different approach to regularization/robustification that has become very popular in the machine learning community. In adversarial training, rather than modifying the training objective with a regularizing term, one instead incorporates the adversary into the training process SZS14MMS18. More precisely, adversarial training replaces the objective 1 with

where the adversarial budget is chosen by the user. When training using 6, the learner is forced to find a function that cannot be easily attacked by an adversary with budget . The advantages of training using 6 compared to 5 are conceptual and computational. Conceptual, because in the formulation 6 one explicitly trains to defend against a well-defined adversary (although in practice this requires “understanding the enemy”). Computational, because 6 is in essence a min-max problem (the adversary maximizes over -perturbations while the learner minimizes by altering ) for which many implementable algorithms exist (for instance alternating gradient descent and ascent steps). Furthermore, the regularizing effect of 6 is data-dependent in contrast to the regularization induced by a standard gradient penalty term which has very little connection to the structure of the data.

On the other hand, compared to 5, it is harder to understand (analytically and geometrically) how exactly 6 is regularizing/robustifying the classifier ; see the discussion in TT22 and references therein. Furthermore, it is not so clear how the user should choose the budget parameter and be mindful of the tradeoff between accuracy (on clean data) and robustness that problem 6 introduces. Answering these questions in full generality is challenging and remains an open problem in the field.

1.4. Outline

To fix some ideas, we write 6 in the general form:

where is an arbitrary probability measure over , the “clean data distribution,” and not just an empirical measure as in 6. We work with arbitrary to avoid distinguishing, whenever unnecessary, between population and finite data settings. Only in some parts of Section 2 will it be important to make assumptions on .

In this paper we explore the following general questions:


What type of regularization is enforced on learning models by the presence of adversaries?


What are the tradeoffs between accuracy and robustness when training models robustly?


How can one actually train models to be robust to specific adversarial attacks?


How can one compute meaningful lower bounds for the (AT) problem?

The above questions are too broad to be answered in complete generality, and in the remainder, we will focus on specific settings where we can reveal interesting geometric and analytic structures. In particular, in Section 2, we explore the type of regularization enforced on binary classifiers, revealing a connection between adversarial training and perimeter minimization problems. This connection will allow us to interpret geometrically the tradeoff between accuracy and robustness. In Section 3, we discuss a concrete game-theoretic interpretation of adversarial training and discuss a more general framework for adversarial robustness based on distributionally robust optimization (DRO). We use this connection with game theory to discuss some potential strategies for training robust learning models and highlight the significance of the concept of Nash equilibrium for adversarial training. Finally, in Section 4, we discuss how an agnostic learner setting can be used to derive lower bounds for more general (AT) problems. We show that in the agnostic learner setting for multiclass classification the adversarial robustness objective can be equivalently rewritten as the geometric problem of finding a (generalized) barycenter of a collection of measures and then discuss the computational implications of this equivalence. We wrap up the paper in Section 5 by discussing some research directions connected to the topics presented throughout the paper.

1.4.1. Additional notation

When working on we will often consider balls associated to a given norm on . We use to denote the dual norm of , which is defined according to

We will denote by the space of Borel probability measures over the set . Given two measures we denote by the space of couplings between and , i.e., probability measures over whose first and second marginals are, respectively, and . We will also use the notion of a pushforward of a measure by a map. Precisely, if is a measurable map between two measurable spaces, and is a probability measure over , we define , the pushforward of by , to be the measure on for which for all measurable subsets of .

2. Adversarial Robustness: Regularization and Perimeter

In this section, we discuss the connection between adversarial training and explicit regularization methods. To motivate this connection, let us first consider a simple robust linear regression setting. In this setting, models in the family take the form

where is some subset of and . Here, the learner’s goal is to select a linear regression function relating inputs to real-valued outputs . We show the following equivalence between problem AT and an explicit regularization problem taking the form of a Lasso-type linear regression:

note that in order to be consistent with the standard definition of the Lasso regularization, we would require to be the -norm to get to be the -norm. Identity 7 is only one of many similar identities relating regularization methods and adversarially robust learning problems for classical families of statistical models. The extent of this type of equivalences is more apparent when considering DRO versions (see Section 3 for a definition) of the adversarial training problem, e.g., see BKM19, where in addition some statistical inference methodologies, motivated by these equivalences, are proposed.

To deduce 7, it is enough to consider the optimization problem at every fixed and realize that the sup can be written as either when , or as when ; see Figure 1 for an illustration. Using the definition of the dual norm one can deduce that in all cases this expression is equal to , from which 7 follows.

Figure 1.

ball around of radius crossed by level sets of function . The value is realized at either or .

Graphic without alt text

Equivalence 7, although limited to the linear regression setting, motivates exploring the regularization effect of adversaries on more general families of learning models. In the remainder of this section we discuss how in the binary classification setting this regularization effect can be related to geometric properties of decision boundaries, in particular to their size or curvature. By presenting this analysis we hope to convey that the connection between adversarial robustness and regularization methods goes beyond simple classical statistical settings, in turn revealing a variety of interesting geometric problems motivated by machine learning.

2.1. Perimeter regularization

Let us consider a binary classification version of AT where , is the - loss (i.e., if the two inputs of are the same, and otherwise), and is a family of binary classifiers for a family of measurable subsets of . Here denotes the indicator function of a subset of , defined according to

Notice that we can “parameterize” a family of binary classifiers with a family of subsets in without losing any generality, due to the fact that binary classifiers output one of two values, or , and thus can be characterized by the regions of points in that they classify as a .

In general, as shown in BGTM23, Problem AT is equivalent to a regularization problem, with a non-local perimeter regularizer, of the form:


Figure 2 illustrates the sets and . In the above, the measures and are the measures over defined according to and , i.e., up to scaling factors they are the conditional distributions of the variable given the possible values that may take. The equivalence between AT and 8 can be deduced, at least at a formal level, by adding and subtracting the term from the term and then identifying the resulting terms with those in 9. To make these computations rigorous and to show existence of solutions to problem 9, there are several technical challenges, beginning with the measurability of the operations involved in defining the problem AT, that must be overcome; the first part of the work BGTM23 discusses some of these challenges.

Now, let us motivate the use of the word perimeter when describing the functional . Suppose that and that the measures and are absolutely continuous with respect to the Lebesgue measure so that we can write them as and for two non-negative Lebesgue-integrable functions and that for simplicity will be assumed to be smooth. In this case, 8 can be rewritten as

Notice that the sets in the volume integrals shrink toward , the boundary of , as we send . Moreover, due to the rescaling factor in front of these integrals, one may anticipate a connection between and the more classical notion of (weighted) perimeter:

where . Note that is precisely the decision boundary between classes and according to the classifier and that is the density, with respect to the Lebesgue measure, of the marginal of the data distribution on the variable. In the definition of we have used , the dimensional Hausdorff measure, which can be used to measure the size of a hypersurface of codimension one. In what follows we discuss two different ways to understand the relationship between and .

One first possible way to relate and is through a pointwise convergence analysis: fix a set with regular enough boundary and then study the behavior of as . This is the type of analysis discussed in GTM22, which was used by the authors to motivate the connection between adversarial robustness in binary classification and geometric variational problems involving perimeter. However, pointwise convergence of toward is not sufficient to ensure that induces a perimeter regularization type effect on its minimizers. For that, we need a different type of convergence.

A different way to compare the functionals and is through a variational analysis; we refer the interested reader to the recent paper BS22, which explains in detail this type of convergence and shows that converges variationally toward , at least when balls are induced by the Euclidean distance. Here we restrict ourselves to discussing some of the mathematical implications of the analysis in BS22, which considers problem 9 when is the set of all (Borel) measurable subsets of ; this setting corresponds to an agnostic learner setting.

First, BS22 shows that minimizers of 8 converge, as , toward minimizers of the problem

where . This means that, as , solutions to the adversarial training problem select among minimizers to the unrobust risk the ones with minimal perimeter. In particular, this result helps capture the idea that when is small, the presence of an adversary has the same effect as imposing a perimeter penalization term on the classifiers; c.f. GTM22 for more discussion on this idea and on the relation with mean curvature flows.

A second consequence of the results in BS22 is an expansion in for the adversarial risk , i.e., the minimum value of AT. Precisely,

where denotes the minimum value in 10. The above result is reminiscent to Danskin’s theorem for functions over Euclidean space, a theorem that is used to characterize the first “derivative” of a function defined as the infimum over a family of functions. Naturally, the difficulty in proving a result like 11 lies in the fact that the functionals of interest take subsets of as input, and thus Danskin’s theorem cannot be applied. Formula 11 can be used to give a geometric interpretation to the rate at which accuracy is lost when building robust classifiers as a function of the adversarial budget . Indeed, this result says that for small , accuracy is lost at the rate given by the Bayes classifier with minimal perimeter. The trade-off between accuracy and robustness has been investigated from statistical perspectives in ZYJ1909, for example. In contrast, the tools and concepts discussed here have a geometric and analytic flavor, with the caveat that they are only meaningful for a population-level analysis of adversarial robustness in binary classification. To gain an even better understanding of the situation, higher-order expansions of the adversarial risk in would be desirable and are a current topic of investigation.

Figure 2.

In green, the set of points in within distance from the boundary of ; in blue, the set of points in within distance from the boundary. The union of and is the region where the adversary may attack and guarantee a mismatch between predicted and true labels.

Graphic without alt text

2.2. Certifiability and regularity

In this section, we discuss notions of certifiability and regularity of robust binary classifiers. We begin with a definition.

Definition 2.1.

Let be a binary classifier. We say that is -certifiable (for ) if or if .

In simple terms, the certifiable points of a given classifier are the points in for which the classification rule stays constant within the ball of radius around them: they are the points that are insensitive to the adversarial attacks in the adversarial problem AT.

While it is not possible to build nontrivial sets for which all points in are certifiable, we can still ask whether it is possible to find robust classifiers that are fully characterized by their certifiable points. This motivates the following definitions.

Definition 2.2.

We say that a measurable set is inner regular if for all there exists such that and . Likewise, we say that is outer regular, if is inner regular. Sets that are both inner and outer regular will be referred to as pseudo-certifiable.

Notice that a classifier that is pseudo-certifiable is completely determined by its outputs on its certifiable points. Pseudo-certifiability is thus a desirable property. It is then natural to wonder whether it is always possible to construct an pseudo-certifiable classifier minimizing the adversarial risk, i.e., a set solving AT when the class of sets is . As it turns out, the notion of pseudo-certifiability is very strong, and in the case of the Euclidean distance, for example, it implies that decision boundaries are locally the graph of a function; see BGTM23 and references therein. An example of a setting where no optimal robust classifier is pseudo-certifiable is given in Figure 3. There, is the sum of four delta measures in at the points , two red and two purple. Any optimal classifier must stay constant within the -balls centered at each point. The color choice in the shaded blue region does not affect optimality, however, there is no way to color this region and maintain -inner regularity for both sets (c.f. Figure 3).

Figure 3.

A data set for which no optimal robust classifier is pseudo-certifiable. If is in the blue region and for some ball of radius , then must intersect both a red and purple circle.

Graphic without alt text

While we cannot guarantee pseudo-certifiability for robust classifiers in general, we can still guarantee existence of -inner regular solutions, -outer regular solutions, and sometimes solutions with other forms of regularity. This is the content of a series of results in BGTM23 stated informally below.

Theorem 2.3 (Informal from BGTM23).

Let be an arbitrary probability measure over , and let be the class of all Borel measurable classifiers. Let be any solution to the (AT) problem. Then there exist two solutions , to AT such that , and


is inner-regular and is outer-regular.


Any measurable set satisfying is a solution to (AT).

Moreover, if is an Euclidean space and the balls are induced by the Euclidean distance, then there exists a solution to AT such that the boundary of is locally the graph of a function.

Figure 4.

A set that is -inner regular (some inwards cusps allowed), a set that is -outer regular (some outward cusps allowed), and a smooth set in between. In the context of Theorem 2.3 for Euclidean balls, the set would be a solution to AT.

Graphic without alt text

The analysis in BGTM23 also provides quantitative estimates for the regularity of the decision boundary of the classifier in the last part of Theorem 2.3. In general, these estimates blow up when . It is however expected that one could get finer regularity estimates under additional assumptions on , e.g., assuming that , , and the set is sufficiently regular. Obtaining these finer estimates and characterizing the needed regularity for these finer estimates to apply are topics of current investigation.

3. Connections to Game Theory: DRO Formulations of AT

In this section, we introduce a framework for adversarial training that encompasses AT and that can be cast more precisely within game theory. In particular, in this larger framework we will be able to discuss the notion of Nash equilibrium in adversarial training and consider its implications on the robust training of learning models.

The idea is as follows. Instead of considering pointwise attacks as in AT, where for every single data point the adversary proposes an attack, we allow the adversary to modify by producing an entirely new data distribution . Naturally, as in model AT, the adversary must pay a price (or use a budget) for carrying out this modification. In precise terms, we consider the following families of problems:


The above are two instances of distributionally robust optimization problems given their inner maximization over probability measures. Notice that, in general, problem 12 can be written as 13 by defining the cost to be if the explicit constraint is satisfied and infinity otherwise. Both and can be interpreted as “distances” between probability distributions.

In the remainder of the paper, we will restrict our attention to problem 13 with a cost function taking the form of an optimal transport problem:

for a cost function that describes the marginal cost that the adversary must pay in order to move a clean data point to a new location (recall that is the space of probability measures on with first marginal and second marginal ). Notice that, in this generality, the adversary has the ability to modify both the feature vector and the label .

Some natural examples of cost functions are , a choice that is particularly meaningful when and ; here, is a positive constant that can be interpreted as reciprocal to an adversarial budget. Another example of cost function of interest is 15 below, which can be used to rewrite problem AT in the form 13. Problem 13 is thus a rather general mathematical formulation for adversarial training.

Proposition 3.1 (Informal).

Problem AT is equivalent to problem 13 for a cost function of the form 14 with marginal cost