PDFLINK |

# An Analytical and Geometric Perspective on Adversarial Robustness

Communicated by *Notices* Associate Editor Daniela De Silva

…y luego se fueron el uno para el otro, como si fueran dos mortales enemigos.

## 1. Introduction

In the last ten years, neural networks have made incredible strides in classifying large data sets, to the point that they can now outperform humans in raw accuracy. However, the robustness of these systems is a completely different story. Suppose you were asked to identify whether a photo contained an image of a cat or a dog. You probably would have no difficulty at all; at worst, maybe you would only be tripped up by a particularly small or unusual Shiba Inu. In contrast, it has been widely documented that an adversary can convince an otherwise well-performing neural network that a dog is actually a cat (or vice-versa) by making tiny human-imperceptible changes to an image at the pixel level. These small perturbations are known as *adversarial attacks* and they are a significant obstacle to the deployment of machine learning systems in security-critical applications GSS14. The susceptibility to adversarial attacks is not exclusive to neural network models, and many other learning systems have also been observed to be brittle when facing adversarial perturbations of data.

The business of defending against adversarial attacks is known as *adversarial robustness*, *robust training*, or simply *adversarial training* (although we will mostly reserve the latter name for a specific optimization objective). There are many methods in the literature that can be used to build defenses against adversarial attacks, but here we will be particularly interested in methods that enforce robustness *during model training*. In these types of methods, the standard training process — driven primarily by accuracy maximization — is substituted by a training process that promotes robustness, typically through the use of a different optimization objective that factors in the actions of a well-defined adversary.

In this article, we give a brief overview of adversarial attacks and adversarial robustness and summarize some recent attempts to mathematically understand the process of robust training. Adversarial training and its mathematical foundations are active areas of research and a thorough review of its extant literature is beyond the scope of this article.Footnote^{1} For this reason, we will focus our discussion around some important lines of research in the theory of adversarial robustness, some of which are based on our own research work in the field, which takes a distinctive analytic and geometric perspective. One of our goals is to convey the mathematical richness of the field and discuss some of the many opportunities that are available for mathematicians to contribute to the development and understanding of this important applied problem.

^{1}

AMS Notices limits to the references per article; we refer to the references cited here for further pointers to the literature.

### 1.1. Basics of training learning models

Data classification or regression typically occurs over a product space of the form Here . is the *data space* or *feature space*, an abstract metric space containing the data points, while is the set of *labels*, usually a finite set for classification tasks or the real line for regression. In the remainder, we mostly focus our discussion on the classification problem. There, the goal is to construct a function that accurately partitions the data space into the possible classes contained in The learner does this by searching for a function . or where is the probability simplex with vertices If we write . then each , represents the learner’s confidence that the data point belongs to class While a probabilistic classifier is typically the desired output of a learning task, note that one can always obtain a deterministic classifier . by selecting the largest entry of any tuple.

To train a machine learning system, one typically needs a finite training set consisting of data pairs i.e., feature vectors with their associated ground truth classification. One then minimizes a ,*loss function* over some chosen function space with the goal of finding a function that produces an accurate classification of the data. For instance, may be the space of all neural networks with a certain architecture, while the loss function typically has the form

where is a function that is small when gives high probability to the ground truth label and large otherwise. In practice, it may only be possible to find a classifier , that is a local minimizer of 1 over though it is often possible to drive the loss function to nearly zero during training via stochastic gradient descent. Either way, well-trained classifiers typically perform well—at least in the absence of adversarial attacks. ,

### 1.2. Adversarial attacks

Given a trained classifier and a data point what is the best way for an adversary to perturb , to produce an incorrect classification? In order for this to be a nontrivial question, we must assume that there are some restrictions on how far the adversary can perturb This restriction is known as the .*adversarial budget*, and it plays a crucial role in both adversarial attacks and robust training. For our purposes, we will formulate the adversarial budget through a parameter and assume that the adversary can only produce a perturbed data point that lies in the ball of radius , centered at the original data point .

Returning to our question, if the adversary has full access to the function and knows that the correct label for is then the most powerful attack for a given budget , is found by replacing with any point satisfying

In practice, the adversary may not be able to find a point that exactly satisfies 2. However, when is a subspace of Euclidean space, a simpler approach that produces highly effective attacks is to perturb the data in the direction of steepest ascent for the loss function by choosing

or by considering the popular *PGD* attack

where denotes the coordinatewise sign of its input; see GSS14MMS 18, and TT22 for a motivation for the PGD attack.

Regardless of how the adversary chooses its attack, there are two key takeaways from formulas 2, 3, and 4 that we would like to highlight. Firstly, we see that adversarial attacks are found by attempting to maximize the loss function with respect to data perturbations. In contrast, the learner trains the classifier by attempting to minimize the loss function among classifiers belonging to a chosen function space (typically a parametric family). Hence, the learner and adversary can be viewed as playing a two-player game where they compete to set the value of the loss function using the tools at their disposal (the learner first gets to choose the adversary then gets to modify data); this connection to game theory will become more important shortly. Secondly, it should be clear from formulas ,2, 3, and 4 that the effectiveness of adversarial attacks must stem from a certain lack of smoothness in the trained classifier Indeed, if . were say 1-Lipschitz, then an adversary with budget could not change the classification probabilities at any point by more than Thus, attacks that can fool an image classifier by making human-imperceptible changes to pixel values must be exploiting a significant lack of regularity. .

### 1.3. Adversarial training

In light of the above considerations, to stave off adversarial attacks one must find a way to construct classifiers with better regularity properties. The most classical (and perhaps most obvious) way to do this would be to replace the training objective 1 with a new objective

where is a term that promotes regularity. For instance, could constrain the Lipschitz constant of or could be some other gradient penalty term. This approach has a long history of success in inverse problems (in that setting one typically adds a regularizing term to a data-fitting term to help mitigate the effect of noise), however, in the context of machine learning, it is often too difficult to efficiently minimize 5. For instance, it is very difficult to train a neural network with a Lipschitz constant constraint. On the other hand, popular and computationally feasible regularization terms for neural network training, for instance, weight regularization, do not seem to provide any defense against adversarial attacks.

Adversarial training is a different approach to regularization/robustification that has become very popular in the machine learning community. In adversarial training, rather than modifying the training objective with a regularizing term, one instead incorporates the adversary into the training process SZS 14MMS 18. More precisely, adversarial training replaces the objective 1 with

where the adversarial budget is chosen by the user. When training using 6, the learner is forced to find a function that cannot be easily attacked by an adversary with budget The advantages of training using .6 compared to 5 are conceptual and computational. Conceptual, because in the formulation 6 one explicitly trains to defend against a well-defined adversary (although in practice this requires “understanding the enemy”). Computational, because 6 is in essence a min-max problem (the adversary maximizes over while the learner minimizes by altering -perturbations for which many implementable algorithms exist (for instance alternating gradient descent and ascent steps). Furthermore, the regularizing effect of )6 is data-dependent in contrast to the regularization induced by a standard gradient penalty term which has very little connection to the structure of the data.

On the other hand, compared to 5, it is harder to understand (analytically and geometrically) how exactly 6 is regularizing/robustifying the classifier see the discussion in ;TT22 and references therein. Furthermore, it is not so clear how the user should choose the budget parameter and be mindful of the tradeoff between accuracy (on clean data) and robustness that problem 6 introduces. Answering these questions in full generality is challenging and remains an open problem in the field.

### 1.4. Outline

To fix some ideas, we write 6 in the general form:

where is an arbitrary probability measure over the “clean data distribution,” and not just an empirical measure as in ,6. We work with arbitrary to avoid distinguishing, whenever unnecessary, between population and finite data settings. Only in some parts of Section 2 will it be important to make assumptions on .

In this paper we explore the following general questions:

- 1.
What type of regularization is enforced on learning models by the presence of adversaries?

- 2.
What are the tradeoffs between accuracy and robustness when training models robustly?

- 3.
How can one actually train models to be robust to specific adversarial attacks?

- 4.
How can one compute meaningful lower bounds for the (AT) problem?

The above questions are too broad to be answered in complete generality, and in the remainder, we will focus on specific settings where we can reveal interesting geometric and analytic structures. In particular, in Section 2, we explore the type of regularization enforced on binary classifiers, revealing a connection between adversarial training and perimeter minimization problems. This connection will allow us to interpret geometrically the tradeoff between accuracy and robustness. In Section 3, we discuss a concrete game-theoretic interpretation of adversarial training and discuss a more general framework for adversarial robustness based on *distributionally robust optimization* (DRO). We use this connection with game theory to discuss some potential strategies for training robust learning models and highlight the significance of the concept of Nash equilibrium for adversarial training. Finally, in Section 4, we discuss how an *agnostic learner* setting can be used to derive lower bounds for more general (AT) problems. We show that in the agnostic learner setting for multiclass classification the adversarial robustness objective can be equivalently rewritten as the geometric problem of finding a (generalized) barycenter of a collection of measures and then discuss the computational implications of this equivalence. We wrap up the paper in Section 5 by discussing some research directions connected to the topics presented throughout the paper.

#### 1.4.1. Additional notation

When working on we will often consider balls associated to a given norm on We use . to denote the dual norm of which is defined according to ,

We will denote by the space of Borel probability measures over the set Given two measures . we denote by the space of couplings between and i.e., probability measures over , whose first and second marginals are, respectively, and We will also use the notion of a .*pushforward* of a measure by a map. Precisely, if is a measurable map between two measurable spaces, and is a probability measure over we define , the pushforward of , by to be the measure on , for which for all measurable subsets of .

## 2. Adversarial Robustness: Regularization and Perimeter

In this section, we discuss the connection between adversarial training and explicit regularization methods. To motivate this connection, let us first consider a simple robust linear regression setting. In this setting, models in the family take the form

where is some subset of and Here, the learner’s goal is to select a linear regression function relating inputs . to real-valued outputs We show the following equivalence between problem .AT and an explicit regularization problem taking the form of a Lasso-type linear regression:

note that in order to be consistent with the standard definition of the Lasso regularization, we would require to be the to get -norm to be the Identity -norm.7 is only one of many similar identities relating regularization methods and adversarially robust learning problems for classical families of statistical models. The extent of this type of equivalences is more apparent when considering DRO versions (see Section 3 for a definition) of the adversarial training problem, e.g., see BKM19, where in addition some statistical inference methodologies, motivated by these equivalences, are proposed.

To deduce 7, it is enough to consider the optimization problem at every fixed and realize that the sup can be written as either when or as , when see Figure ;1 for an illustration. Using the definition of the dual norm one can deduce that in all cases this expression is equal to from which ,7 follows.

Equivalence 7, although limited to the linear regression setting, motivates exploring the regularization effect of adversaries on more general families of learning models. In the remainder of this section we discuss how in the *binary classification setting* this regularization effect can be related to geometric properties of decision boundaries, in particular to their size or curvature. By presenting this analysis we hope to convey that the connection between adversarial robustness and regularization methods goes beyond simple classical statistical settings, in turn revealing a variety of interesting geometric problems motivated by machine learning.

### 2.1. Perimeter regularization

Let us consider a binary classification version of AT where , is the - loss (i.e., if the two inputs of are the same, and otherwise), and is a family of binary classifiers for a family of measurable subsets of Here . denotes the indicator function of a subset of defined according to ,

Notice that we can “parameterize” a family of binary classifiers with a family of subsets in without losing any generality, due to the fact that binary classifiers output one of two values, or and thus can be characterized by the regions of points in , that they classify as a .

In general, as shown in BGTM23, Problem AT is equivalent to a regularization problem, with a *non-local perimeter* regularizer, of the form:

where

Figure 2 illustrates the sets and In the above, the measures . and are the measures over defined according to and i.e., up to scaling factors they are the conditional distributions of the variable , given the possible values that may take. The equivalence between AT and 8 can be deduced, at least at a formal level, by adding and subtracting the term from the term and then identifying the resulting terms with those in 9. To make these computations rigorous and to show existence of solutions to problem 9, there are several technical challenges, beginning with the measurability of the operations involved in defining the problem AT, that must be overcome; the first part of the work BGTM23 discusses some of these challenges.

Now, let us motivate the use of the word *perimeter* when describing the functional Suppose that . and that the measures and are absolutely continuous with respect to the Lebesgue measure so that we can write them as and for two non-negative Lebesgue-integrable functions and that for simplicity will be assumed to be smooth. In this case, 8 can be rewritten as

Notice that the sets in the volume integrals shrink toward the boundary of , as we send , Moreover, due to the rescaling factor . in front of these integrals, one may anticipate a connection between and the more classical notion of (weighted) perimeter:

where Note that . is precisely the decision boundary between classes and according to the classifier and that is the density, with respect to the Lebesgue measure, of the marginal of the data distribution on the variable. In the definition of we have used the , dimensional Hausdorff measure, which can be used to measure the size of a hypersurface of codimension one. In what follows we discuss two different ways to understand the relationship between and .

One first possible way to relate and is through a *pointwise* convergence analysis: fix a set with regular enough boundary and then study the behavior of as This is the type of analysis discussed in .GTM22, which was used by the authors to motivate the connection between adversarial robustness in binary classification and geometric variational problems involving perimeter. However, pointwise convergence of toward is not sufficient to ensure that induces a perimeter regularization type effect on its minimizers. For that, we need a different type of convergence.

A different way to compare the functionals and is through a *variational* analysis; we refer the interested reader to the recent paper BS22, which explains in detail this type of convergence and shows that converges variationally toward at least when balls are induced by the Euclidean distance. Here we restrict ourselves to discussing some of the mathematical implications of the analysis in ,BS22, which considers problem 9 when is the set of all (Borel) measurable subsets of this setting corresponds to an ;*agnostic* learner setting.

First, BS22 shows that minimizers of 8 converge, as toward minimizers of the problem ,

where This means that, as . solutions to the adversarial training problem select among minimizers to the unrobust risk the ones with minimal perimeter. In particular, this result helps capture the idea that when , is small, the presence of an adversary has the same effect as imposing a perimeter penalization term on the classifiers; c.f. GTM22 for more discussion on this idea and on the relation with *mean curvature flows*.

A second consequence of the results in BS22 is an expansion in for the adversarial risk i.e., the minimum value of ,AT. Precisely,

where denotes the minimum value in 10. The above result is reminiscent to Danskin’s theorem for functions over Euclidean space, a theorem that is used to characterize the first “derivative” of a function defined as the infimum over a family of functions. Naturally, the difficulty in proving a result like 11 lies in the fact that the functionals of interest take subsets of as input, and thus Danskin’s theorem cannot be applied. Formula 11 can be used to give a geometric interpretation to the rate at which accuracy is lost when building robust classifiers as a function of the adversarial budget Indeed, this result says that for small . accuracy is lost at the rate given by the Bayes classifier with minimal perimeter. The trade-off between accuracy and robustness has been investigated from statistical perspectives in ,ZYJ 1909, for example. In contrast, the tools and concepts discussed here have a geometric and analytic flavor, with the caveat that they are only meaningful for a population-level analysis of adversarial robustness in binary classification. To gain an even better understanding of the situation, higher-order expansions of the adversarial risk in would be desirable and are a current topic of investigation.

### 2.2. Certifiability and regularity

In this section, we discuss notions of certifiability and regularity of robust binary classifiers. We begin with a definition.

In simple terms, the certifiable points of a given classifier are the points in for which the classification rule stays constant within the ball of radius around them: they are the points that are insensitive to the adversarial attacks in the adversarial problem AT.

While it is not possible to build nontrivial sets for which all points in are certifiable, we can still ask whether it is possible to find robust classifiers that are fully characterized by their certifiable points. This motivates the following definitions.

Notice that a classifier that is pseudo-certifiable is completely determined by its outputs on its certifiable points. Pseudo-certifiability is thus a desirable property. It is then natural to wonder whether it is always possible to construct an pseudo-certifiable classifier minimizing the adversarial risk, i.e., a set solving AT when the class of sets is As it turns out, the notion of pseudo-certifiability is very strong, and in the case of the Euclidean distance, for example, it implies that decision boundaries are locally the graph of a . function; see BGTM23 and references therein. An example of a setting where no optimal robust classifier is pseudo-certifiable is given in Figure 3. There, is the sum of four delta measures in at the points two red and two purple. Any optimal classifier must stay constant within the , centered at each point. The color choice in the shaded blue region does not affect optimality, however, there is no way to color this region and maintain -balls regularity for both sets (c.f. Figure -inner3).

While we cannot guarantee pseudo-certifiability for robust classifiers in general, we can still guarantee existence of regular solutions, -inner regular solutions, and sometimes solutions with other forms of regularity. This is the content of a series of results in -outerBGTM23 stated informally below.

The analysis in BGTM23 also provides quantitative estimates for the regularity of the decision boundary of the classifier in the last part of Theorem 2.3. In general, these estimates blow up when It is however expected that one could get finer regularity estimates under additional assumptions on . e.g., assuming that , , and the set , is sufficiently regular. Obtaining these finer estimates and characterizing the needed regularity for these finer estimates to apply are topics of current investigation.

## 3. Connections to Game Theory: DRO Formulations of AT

In this section, we introduce a framework for adversarial training that encompasses AT and that can be cast more precisely within game theory. In particular, in this larger framework we will be able to discuss the notion of Nash equilibrium in adversarial training and consider its implications on the robust training of learning models.

The idea is as follows. Instead of considering pointwise attacks as in AT, where for every single data point the adversary proposes an attack, we allow the adversary to modify by producing an entirely new data distribution Naturally, as in model .AT, the adversary must pay a price (or use a budget) for carrying out this modification. In precise terms, we consider the following families of problems:

and

The above are two instances of *distributionally robust optimization* problems given their inner maximization over probability measures. Notice that, in general, problem 12 can be written as 13 by defining the cost to be if the explicit constraint is satisfied and infinity otherwise. Both and can be interpreted as “distances” between probability distributions.

In the remainder of the paper, we will restrict our attention to problem 13 with a cost function taking the form of an optimal transport problem:

for a cost function that describes the marginal cost that the adversary must pay in order to move a clean data point to a new location (recall that is the space of probability measures on with first marginal and second marginal Notice that, in this generality, the adversary has the ability to modify both the feature vector ). and the label .

Some natural examples of cost functions are a choice that is particularly meaningful when , and here, ; is a positive constant that can be interpreted as reciprocal to an adversarial budget. Another example of cost function of interest is 15 below, which can be used to rewrite problem AT in the form 13. Problem 13 is thus a rather general mathematical formulation for adversarial training.