PDFLINK |

# Stochastic Separation Theorems: How Geometry May Help to Correct AI Errors

Communicated by *Notices* Associate Editor Emilie Purvine

## 1. Introduction

Recent years have seen explosive progress in data-driven artificial intelligence (AI) systems. Many decades of the development of mathematics underpinning statistical learning theory coupled with advancements in approximation theory, numerical analysis, technology, and computing gave rise to new-generation AI transforming our life. These systems show great promise in cancer diagnostics MSG 20, they are a part of autonomous cars 22, automated face recognition and biometrics KE21, image segmentation SBKV 20, language processing and translation tools DZS 22, and as such become our new reality. Availability of unprecedented volumes of data, citizens’ expectations and participation are further driving this change.

New reality, however, brings new challenges. Uncertainties and biases are inherent within any empirical data. They enter production pipelines of data-driven AI and ripple through them causing errors. AI instabilities and adversarial examples—errors due to minor changes in data or structure—have recently been found in many advanced data-driven AI models. Moreover, mounting evidence suggests that these errors are in fact expected in such systems THG20 and may not always be cured by larger volumes of data or better training algorithms BHV21 as long as the AI architecture remains fixed.

This leads to the following question: if errors are inevitable in data-driven AI then how do we deal with them once they occur?

One way to address this imminent challenge is to equip an AI with an “error filter” or “error corrector” GT18. The function of the AI corrector is to learn from errors “on-the-job,” supplementing the AI’s initial training. Dynamic addition of AI correctors continuously extends AI architecture, adapts to data uncertainty GGG 18, and enables AI to escape the stability barrier revealed in BHV21. When a new data arrives at AI input, the AI error corrector then decides if it is likely to cause an error, and if so, then reports. To do this, the filter uses some set of attributes, such as, for example, internal latent representations of the input in AI decision space. To each attribute the system assigns some weight , For each new input, the system computes numerical values . of all attributes and compares the weighted sum , with some threshold to decide whether to report the input as an error.

However, how does the filter determine the weights of all attributes? To do this, the filter is provided a training set of example inputs marked as correct and errors. Then the system tries to find weights such that, ideally, all data in the training set are classified correctly. Moreover the system tries to ensure that all (or a large proportion of) future “unseen” inputs would be processed correctly too. In other words, the system seeks to learn the weights from some training data, and the error filter itself is therefore an example of a machine learning (ML) system.

Geometrically, any input is described by the values of the attributes, and can therefore be represented as a point in the Euclidean space, where -dimensional is the number of attributes. Then the criterion for an input being an error defines a half-space, whose boundary is the hyperplane defined by the equation If we mark points corresponding to errors and correct AI behavior as red and blue, respectively, the machine learning task of error identification reduces to finding a hyperplane that separates the red points from the blue ones; see Figure .1.

Assume that such hyperplane exists and we have started to use the filter with the corresponding weights Imagine, however, that a new input has arrived, which the filter classified as correct but the user marked as an error. In other words, the filter itself made an error. Of course, we would like the system to be able to learn from such errors and improve its performance in the future. An obvious way to do this is to add a new point to the training set and recompute the weights. This constitutes “retraining the system.” Geometrically, this means that a new red point . appears on the “wrong” side of the hyperplane, so that we try to find a different hyperplane that separates all points correctly; see Figure 2. Obviously, it is not always possible to find such a hyperplane; see Figure 3. Moreover, even if it is possible, it may require substantial time to recompute all weights every time the filter makes an error.

Alternatively, one may use the following error-correction method, suggested in GMT19: separate a new red point from the existing blue points by another hyperplane given by an equation see Figure ;3. After this, classify any new input as error if either or .

A careful reader may have already noticed a limitation of this approach that appears to be fundamental: why did we assume that a point can be separated from all other points by a hyperplane? Obviously, if that point belongs to the convex hullFootnote^{1} of other points, then a separating hyperplane does not exist and the method does not work; see Figure 4. For example, even if we have just points then , may lie in the interior of the line interval and in this case it cannot be separated from , .

However, intuitively, the described case is in some sense “degenerate” and should not happen too often with real data. The best way to formalise this intuition is to use the language of probability theory, and ask what is the probability that the method would work for random data. This leads to a very nice problem that lies on the borderline of probability theory and geometry.

## 2. Sylvester’s Problem

Problem 1 has a long history and goes back to at least the question asked by Sylvester in 1864: given random points on the plane, what is the probability that they form a convex quadrilateral?

To address this question, it is convenient to introduce the following random variables. Let be the random variable equal to if point is inside the triangle and otherwise. Let random variables , and be defined similarly. Then random variable

counts the number of points that are inside the triangle formed by other points. Hence, precisely if form a convex quadrilateral, and this happens with probability With probability . , Thus, the expected value . and .

This implies that to find it suffices to find From the linearity of the expectation, and assuming that . are drawn independently from the same distribution,

Next, is a random variable that takes values or and ,

where is the probability that lies inside triangle .

If points are selected independently and uniformly at random from the unit disk then by the law of total expectation, ,

where denotes the area. Hence, the problem reduces to determining the expected area of the triangle In 1867, Woolhouse determined that .

hence

Of course, random points can be selected inside regions different from a disk. Sylvester also asked the same question in the following modified form. Let be a convex body in the plane (that is, a compact convex set with non-empty interior) and choose four points from independently and uniformly at random. What is the probability that these points are the vertices of a convex quadrilateral? Further, for what is this probability the smallest and the largest? The second question has been solved by Blaschke, who proved in 1917 that for all convex bodies ,

where and denotes a triangle and a disk in the plane, respectively.

Sylvester’s question can be asked for points: if they are selected uniformly at random in a convex body in the plane, what is the probability that they form a convex -gon?

In 1995, Valtr solved this problem exactly for a parallelogram and proved that ,

In 1996, Valtr also solved this problem for triangle and showed that ,

Using Stirling’s approximation for the factorial, it is straightforward to prove that

Because any convex body in the plane can be sandwiched between two triangles, this implies the existence of universal constants such that

for all and all In fact, Bárány .Bár99 proved in 1999 that

for some constant that depends on For example, . for parallelogram, for triangle, and for disk. In particular,

approaches as with super-exponential speed.

Can we have the exact (non-asymptotic) formulas for In 1971, Miles derived the exact formula for ? :

Finally, Marckert in 2017 derived exact (but somewhat complicated) formulas for for an arbitrary For example, for . ,

The following table lists numerical values for , and for .

As expected, we see that the probabilities decrease fast even for small values of This is bad news for our machine learning application, because it shows that new points will most likely be in the convex hull of other points. However, .*all these results are in the plane*, which corresponds to a (toy) machine learning system with just two attributes. Any real ML system has significantly more attributes, hence we should study Problem 1 in higher-dimensional spaces. In the next section we show that separability properties of random points in higher dimensions are dramatically different from those computed for our low-dimensional example.

## 3. The Effect of Higher Dimension

We start our analysis of Problem 1 in higher dimensions with a simple special case. Let be the closed unit ball in We first consider the case when points . are fixed, and is selected uniformly at random in In 1986, Elekes .Ele86 proved that for any points we have ,

where

Now assume that we select

The upper bound 3 was originally proved by Bárány and Füredi in 1988. Complementing this result, Bárány and Füredi also proved that, for all

independent uniformly distributed points in

The following table shows, in various dimensions

We see that in dimension

This effect is not limited to the uniform distribution in the unit ball *family* of distributions, one for each dimension: the uniform distribution on the interval

of probability measures, where

Condition 4 says that, with probability exponentially close to

The theorem below states that if a family of distributions has the SmAC property, then exponentially many points are in convex position with high probability.

The SmAC condition is very general and holds for a large variety of distributions. As an illustration, consider a special case of i.i.d. data. If probability measures

where

For example, let

for some constant

## 4. Computing Separating Hyperplanes

Under SmAC condition, exponentially many random points

If

The above program is a version of the well-known maximal-margin classifier or a support vector machine. This quadratic program has

The other issue with finding separating hyperplanes through solving 7 is that this approach requires full knowledge of all points *correcting AI errors*. In this context,

In the next sections we show that, for appropriately high dimension

### 4.1. One-shot separability: Fisher separability

In order to develop the intuition for Problem 2, let us return to the simplest example, when the points are selected uniformly at random from the

where

Now, if we have

Remarkably, this bound is even less restrictive than 3, while the conclusion is stronger: not only are this many points in convex position with probability greater than *specific* hyperplane tangent to

It turns out that this simple idea to choose hyperplane

The question is: how do we know that

Several interesting observations stem immediately from Proposition 1. It appears that construction of separating hyperplanes does not always require complete knowledge of sets that are being separated. Some rough information such as the value of the point

It turns out that Fisher separability for exponentially many points holds for many important families of distributions, including rotation invariant log-concave distributions and product distributions whose components have bounded support or very fast-decaying tails GGT21. At the same time, there are examples of product distributions with identical log-concave components for which this is no longer true GGT21. It is hence natural to ask if and how similar simple solutions could be derived for such distributions with “heavier” tails.

### 4.2. One-shot separability: general case

Now we formulate the same idea in general. Let