From Bézier to Bernstein

Posted November 2008.

Bézier curves are ubiquitous in computer graphics. ...

Bill Casselman
University of British Columbia, Vancouver, Canada
cass at math.ubc.ca

Introduction

Bézier curves are ubiquitous in computer graphics. They were introduced implicitly into theoretical mathematics long before computers, primarily by the French mathematician Charles Hermite and the Russian mathematician Sergei Bernstein. But it was only the work of Pierre Bézier, an employee of the automobile maker Renault, and of Paul de Casteljau, of Citroen, that made these curves familiar to graphics specialists. Recently, the polynomials defined by Bernstein have become again of interest to mathematicians.

Bézier curves

The simplest Bézier curve is the straight line from one point P₀ to another P₁, with the parametric equation

B(t) = P₀ + t(P₁ - P₀) = (1-t) P₀ + t P₁

from which it follows immediately that

B(0) = P₀
B(1) = P₁ .

For t in between 0 and 1 the point B(t) is t of the way from one to the other. This is the same as the weighted average of the two points, with P₀ given weight 1-t and P₁ given weight t. When t=1/2, for example, B(t) is the point (1/2)(P₀ + P₁) halfway between P₀ and P₁.

A quadratic Bézier curve is determined by three control points P₀, P₁, and P₂. It has the parametric form

B(t) = (1-t)² P₀ + 2t(1-t) P₁ + t² P₂

When t=0 all but terms but the first vanish, and when t=1 all but the last vanish. Therefore

B(0) = P₀
B(1) = P₂.

The velocity vector at t is

B'(t) = -2(1-t)P₀ + 2 (1-t)P₁ - 2 t P₁ + 2t P₂ = 2 [ (1-t) (P₁-P₀) + t (P₂-P₁) ] .

When t=0 the velocity vector is 2(P₁-P₀) and when t=1 it is 2(P₂-P₁). Thus, the path starts at P₀, ends at P₂, and its tangents at P₀ and P₂ intersect at P₁.

A cubic Bézier curve is determined by four control points P₀, P₁, P₂, and P₃. It has the parametric form

B(t) = (1-t)³P₀ + 3 t(1-t)²P₁ + 3t²(1-t) P₂ + t³P₃

Again, B(0) = P₀ and B(1) = P₃.

The velocity vector at t is

B'(t) = 3 [ (1-t)²(P₁-P₀) + 2t(1-t)(P₂_P₁) + t²(P₃-P₂) ].

When t=0 the velocity vector is 3(P₁-P₀) and when t=1 it is 3(P₃-P₂). Thus, the path starts at P₀, ends at P₃; when it leaves P₀ it is heading towards P₁, and when it arrives at P₃ it is coming from the direction of P₂. Otherwise, the relationship between the path and the control points is intuitively weak.

In all these cases, the coefficients of the points P_i in the parametric equation are terms which appear in the binomial expansion, respectively

(1-t), t
(1-t)², 2t(1-t), t²
(1-t)³, 3t(1-t)², 3t²(1-t), t³ .

For t between 0 and 1 these are non-negative, and by the binomial theorem the sum is [ t + (1-t) ]ⁿ = 1ⁿ = 1 for n=1, 2, 3. What this means is that each point B(t) is a weighted average of the control points, hence lies inside the convex hull of those points.

A Bézier curve of one degree can be reproduced by one of higher degree. For example, the quadratic Bézier curve with control points P₀, P₁, P₂ is the same as the cubic Bézier curve with control points

Q₀ = P₀
Q₁ = (1/3)P₀ + (2/3)P₁
Q₂ = (2/3)P₁ + (1/3)P₂
Q₃ = P₂

This can be checked by a simple algebraic calculation.

Calculus and Bézier curves

Suppose B(t) to be a cubic Bézier curve. We know that its coordinates are functions of t, and that position and velocity at endpoints are determined by the equations:

B(0) = P₀
B'(0) = 3(P₁-P₀)
B'(1) = 3(P₃-P₂)
B(1) = P₃ .

These conditions determine B(t) uniquely, because of a simple result first observed by the French mathematician Charles Hermite. The conditions above are really conditions on the coordinates of B(t), so the claim follows from this:

THEOREM.

Given numbers y₀, v₀, y₁, v₁, there exists a unique polynomial P(x) of degree 3 such that

P(0) = y₀
P'(0) = v₀
P(1) = y₁
P'(1) = v₁ .

We can guess a formula for P(t), from that of Bézier cubic curves:

P(t) = (1-t)³y₀ + 3(1-t)²t [y₀ + (1/3)v₀] + 3(1-t)t² [ y₁ - (1/3)v₁] + t³ y₁ .

As for uniqueness, if we have two such polynomials, let D(x) = a + bx + cx² + dx³ be their difference. Then

P(0) = a = 0
P'(0) = b = 0
P'(1) = b + 2c + 3d = 0
P(1) = a + b + c + d = 0 .

These tell us immediately that a=b=0, and these four equations immediately reduce to two:

2c + 3d = 0
c + d = 0 ,

which have only the solutions c=d=0. Q.E.D.

This is a special case of the main result of Hermite, according to which we may find a unique polynomial of degree n-1 satisfying a total of n conditions on values of the low order derivatives at possibly different points. This is called Hermite interpolation. This Theorem suggests a converse process. Suppose we are given a formula for a function f(x), as well as a formula for its derivative f'(x), and suppose we want to graph the function in a range a ≤ x ≤ b. We could just draw a bunch of very small line segments, but we can get a smoother curve by using Bézier cubic curves. Because if we are given a mediumly small range [c, d] it seems like a reasonable idea to approximate the function over that range by a cubic polynomial P(x) that satisfies these conditions:

P(c) = f(c)
P'(c) = f'(c)
P'(d) = f'(d)
P(d) = f(d) .

To see how Hermite's result applies, replace P varying over the range [c,d] by Q(t) where t varies over [0,1], and with

Q(t) = P((1-t)c + td), P(x) = Q((x-c)/(d-c)) .

To this Q we can apply Hermite's formula. A little algebra will then show that the graph between c and d will be approximated by the Bézier cubic with control points

(c, f(c))
(c + h/3, f(c) + (h/3)f'(c))
(d - h/3, f(d) - (h/3)f'(d))
(d, f(d))

where h=d-c.

There is one situation where this technique is definitely a good idea. Suppose that

where we do not have a formula for the indefinite integral of g(x). For example, we might have

Then we must approximate the integral numerically, say by Simpson's rule, so we have a recursive estimate

f(x+h) = f(x) + (h/6) ( g(x) + 4 g(x + h/2) + g(x+h) )

and, by the Fundamental Theorem of Calculus, the graph between x and x+h for small h is approximated by the Bézier curve with control points

(x, f(x))
(x + h/3, f(x) + (h/3)g(x))
(x + 2h/3, f(x+h) - (h/3)g(x+h))
(x + h, f(x+h))

where the last is computed according to Simpson's rule.

The same idea can be nicely used to plot the trajectories of systems of differential equations in the plane, using a numerical approximation to go from f(x) to f(x+h), such as one of the Runge-Kutta formulas.

Why Bézier curves?

There is something at first a bit odd about cubic Bézier curves, the fact that the curves don't actually go through the control points. So let's imagine an alternate way to draw curves that go nicely from one point P₀ to another point P₃, with a cubic parametrization P(t). The only technique that suggests itself naturally is choosing two points P₁ and P₂, and having the curve pass through them, not just near them as Bézier curves do. This means also choosing parameter values t₁ and t₂ with P(t₁) = P₁ and P(t₂) = P₂. The first choice that comes to mind is t₁ = 1/3, t₂=2/3. But it is well known (and explained clearly in the book by Henrici referred to later) that the curves chosen in this way look very odd. What turns out to be more reasonable is to choose t₁ closer to 0, t₂ closer to 1. But this means that we are very close to fixing values of the velocity at 0 and 1, which is essentially what Bézier curves do.

The deciding factor in choosing Bézier curves is a recursive property that makes them extremely efficient to compute. There is an algorithm attributed to de Casteljau that draws them very quickly. This depends on a certain recursive property. Suppose we are given a quadratic Bézier curve with control points P_i. Let's now divide it up into two halves, with P₀₁₂ = B(1/2). Let P₀₁ be the point midway between P₀ and P₁ , P₁₂ be that midway between P₁ and P₂. In these circumstances: (1) the point P₀₁₂ is the point midway between P₀₁ and P₁₂ ; (2) the half curve between P₀ and the midpoint P₀₁₂ is again a quadratic Bézier curve with control points P₀ , P₀₁ , and P₀₁₂. Similarly for the second half.

Therefore, in order to draw the curve the computer can keep calculating midpoints - very easy to do on modern binary machines since it involves division by 2 - until it has broken the curve into a number of small segments that are essentially straight, and then it can draw all those as straight line segments. Something very similar happens for cubic Bézier curves, too - each half of a cubic Bézier curve is a Bézier curve with easily calculated control points, as the neighboring figure illustrates.

Bézier curves and fonts

One of the most common uses of Bézier curves is in the design of fonts. Cubic Bézier curves are used in Type 1 fonts, and quadratic Bézier curves are used in True Type fonts. Cubic Bézier curves are also used in the TEX fonts designed by Donald Knuth, and one of the clearest explanations is in his book MetaFont: the Program.

Bernstein polynomials

The formulas for the coordinates of Bézier curves

B₁(t) = (1-t) x₀ + t x₁
B₂(t) = (1-t)² x₀ + 2 (1-t)t x₁ + t² x₂
B₃(t) = (1-t)³ x₀ + 3 (1-t)²t x₁ + 3(1-t)t² x₂ + t³ x₃

are special cases of a more general formula for polynomials of all degrees

B_n,x(t) = (1-t)ⁿ x₀ + n (1-t)^n-1t x₁ + n(n-1)/2 (1-t)^n-2t² x₂+ ... + C_n,k (1-t)^n-kt^k x_k + ... + tⁿ x_n

where C_n,k is the binomial coefficient n!/k!(n-k)! Thus B_n,x(0)= x₀, B_n,x(1) = x_n. Here x is the array (x_i).

These polynomials were first defined by the Russian mathematician Sergei Bernstein around 1910. The coefficients of a Bernstein polynomial of degree n are the coefficients in the binomial expansion of [(1-t) + t]ⁿ. They are all non-negative, and their sum is 1. The value of B_n(t) is therefore a weighted average of the numbers x_i in the array x. The term C_n,k (1-t)^n-kt^k is also the probability of k successes in n trials in which the probability of a success is t. This is not an unimportant fact, as we shall see. They have a number of fundamental properties that we have seen already for Bézier curves:

The derivative of B_n,x(t) is B'_n,x(t) = nB_n-1,dx(t) where dx is the array of differences (x_i+1-x_i).

Thus both B'_n,x(0) and B'_n,x(1) have simple expressions. Otherwise, the exact relationship between the polynomial and its control values is rather obscure. One other useful feature is that

The graph of y = B_n,x(t) for t between 0 and 1 lies inside the convex hull of the points (k/n, x_k).

A related fact, which is crucial in the application of Bernstein polynomials to optimization problems:

The maximum and minimum values of B_n,x(t) in [0,1] are bounded by the maximum and minimum values of its coefficients.

And here is the proper generalization of de Casteljau's algorithm:

The restrictions of B_n,x(t) to each half-range [0,1/2] and [1/2,1] are also Bernstein polynomials, in the sense that each of B_n,x(2t) and B_n,x(1-2t) can be expressed as B_n,y(t) for some array y that is simple to calculate.

These polynomials were invented by Sergei Bernstein in order to prove a fundamental result in approximation theory, as we shall see in the next section. They are impractical in this role, but in recent years they have proven important in optimization theory. See, for example, the thesis of Roland Zumkeller (linked to in the References). This is gratifying - almost anyone who has seen Bernstein's proof of Weierstrass' theorem must have felt that these polynomials were destined to play other roles, as well.

Bernstein's proof of Weierstrass' approximation theorem

In about 1885 Karl Weierstrass, at the age of 70, published a proof of one of the theorems for which he is most famous: Any continuous function on the interval [0,1] may be uniformly approximated, arbitrarily closely, by polynomials. The definition of continuous functions is relatively simple, and this definition is in close accord with intuition, so it was apparently surprising to learn, as mathematicians of the nineteenth century acquired more and more knowledge about them, that continuous functions could exhibit rather strange behavior. Weierstrass' theorem was perhaps considered refreshing evidence that they were not so wildly behaved after all.

Many outstanding mathematicians soon came up with proofs rather different from Weierstrass' original one. One of the most satisfying was that found by Sergei Bernstein, and it was apparently for this purpose that he came up with the polynomials now named after him. Unlike other versions, his provides a very explicit converging sequence of polynomials.

Suppose f is a continuous function on [0,1]. For each n let f_n be the array of n+1 values f(k/n) for k = 0, ..., n. Explicitly, we have

B_n,f(t) = B_{n, f_n}(t) = f(0) (1-t)ⁿ + f(1/n) n (1-t)^n-1 t + ... + f(k/n) C_n,k (1-x)^n-kt^k + ... + f(1) tⁿ

THEOREM. Suppose f(x) to be a continuous function on [0,1]. The polynomials B_n,f(x) converge uniformly on [0,1] to f(x) as n goes to infinity.

Here for example, is Bernstein's approximation to f(x) = |x-1/2| for n=32:

and here is a table of the Bernstein approximations of |t-1/2| at t=1/2 for various values of n:

n	*B_n,f(1/2)*
4	*0.1367*
8	*0.0982*
16	*0.0700*
32	*0.0497*
64	*0.0352*
*128*	*0.0249*

As you can guess, the convergence here is of order about 1/n^1/2 around t=1/2, and better at other points. We can see that although one virtue of Bernstein polynomials is that they can approximate arbitrary continuous functions, on functions that one sees in practice they converge at an impractically slow rate.

Bernstein's proof of convergence is very elegant. It is a variation on a simple argument used in a slightly different context, that of local averaging. Suppose f to be a continuous function on [0,1]. Define for each n a function

f_n(x) = (1/2)( f(x - 1/2n) + f(x + 1/2n) ) .

In other words, the value of f_n at x is the average of its values at x-1/2n and x+1/2n, an interval of width 1/n. Or, to give slightly more complicated examples:

f_n(x) = (1/4)( f(x - 1/n) + 2 f(x) + f(x + 1/n) )

f_n(x) = (1/8) ( f(x - 3/2n) + 3 f(x - 1/2n) + 3 f(x + 1/2n) + f(x + 3/2n) ) .

These are now different weighted averages of f near x. Bernstein's definition of f_n(x) is also a weighted average of values of f around x, but the weighting depends on x. The value of the polynomial B_n,f at x is also a weighted average of f around x, although that is not at all apparent at first sight. It is a sum of values of f at certain point k/n, and what happens is that only the terms for k/n near x make a sizable contribution. Here are some graphs of the weighting for various values of x, with n=100:

Why is this? The weighting function is the distribution of k successes in n trials with a probability of success equal to x. The average number of successes - the mean value of this probability distribution - is x, and the real basis of Bernstein's reasoning is that the standard deviation of this distribution - the quantitative measure of its spread - is

σ = √

x(1-x)/n

As n grows, this shrinks. What this means is that the probability distribution clusters more narrowly around the mean value x, and more or less uniformly for all x. The proof uses this observation. Since the sum of the binomial coefficients is 1, we have

| f(x) - B_{n, f_n}(x) | ≤ Σ_k | f(x) - f(k/n) | C_n,k (1-x)^n-k x^k

We can divide the sum over the point k/n into those which are near x and those far away, since we expect the sum over the near ones will be the major contribution. Say we choose some small number δ and sum over k/n with |x-k/n| ≤ δ and those with |x-k/n| > δ.

The first is

Σ_{|x-k/n| ≤ δ} | f(x) - f(k/n) | C_n,k (1-x)^n-k x^k .

If we choose δ small enough so | f(x) - f(k/n) | < ε/2 whenever |x - k/n| ≤ δ, this will be less than ε/2.

Now let let M be the maximum spread of f on [0,1]. The second sum is

Σ_{|x-k/n| ≤ δ} | f(x) - f(k/n) | C_n,k (1-x)^n-k x^k

which is at most M times the sum

Σ_{|x-k/n| > δ} C_n,k (1-x)^n-k x^k .

We must now choose n so that this sum is at most ε/2M. Being able to do so is plausible, since as n increases the range over which C_n,k (1-x)^n-k x^k is sizable becomes narrower. There are many ways to make this idea rigorous - the simplest is to use Chebyshev's inequality. This says that if p_i is a discrete probability distribution with mean m and standard deviation σ then the probability of |t - m| ≥ s σ is at most 1/s². This is a very elementary fact, and we'll come back to justify it in a moment. Given it, we see that if n > 2M x(1-x) / ε δ² then the sum above is less than ε/2.

Proof of Chebyshev's inequality. Since |t - m|/ sσ > 1 if and only if |t - m|²/s² σ² > 1, the sum is that of all p_i over all t_i with |t_i - m|²/s² σ² ≥ 1. But this is at most the sum of |t_i-m|²/s²σ² over the same region, which is in turn is at most the sum over all i of the same terms, which is (by definition of standard deviation)

Σ_i p_i ( |t_i - m|²/s² σ² ) = ( 1/s² σ² ) Σ_i p_i |x_i - m|² = 1/s²

References

To find out more ...

... about Bézier curves

The first two books contain articles by Pierre Bézier on the history of his work with the curves named after him.

R. E. Barnhill and R. F. Riesenfeld (editors), Computer Aided Geometric Design, Academic Press, 1974.
This book contains papers presented at a conference at the University of Utah that initiated much of modern computer graphics.
G. Farin, Curves and surfaces for computer aided design, Academic Press, 1988.

Various works by Donald Knuth explain clearly the role of Bézier curves in font design.

D. E. Knuth, MetaFont: the Program, Addison-Wesley, 1986.
The source code for this book is in the WEB file mf.web available in CTAN distributions, and in particular at http://www.ctan.org/tex-archive/systems/knuth/dist/mf/. This can be turned to a TEX file by applying the program weave. Pages 123 - 131 explain extremely clearly the author's implementation of Bézier curves in his program MetaFont. For the admittedly rare programmer who wishes to build his own implementation (at the level of pixels), or for anyone who wants to see what attention to detail in first class work really amounts to, this is the best resource available.
MetaFont (Wikipedia)
MetaFont: the Book (TEX source)
Be sure also to download the macros necessary to TEX it.

The next two items are about interpolation of various kinds.

Charles Hermite, "Sur la formule d'interpolation de Lagrange", Journal für die reine und angewandte mathematik, volume 84, 1877.
This is available from the Göttingen Digital Archive.
Peter Henrici, Essentials of numerical analysis, Wiley, 1982
Discusses interpolation and the related topic of splines very clearly.

... about Bernstein polynomials

Wikipedia has several useful entries involving Bernstein polynomials and interpolation.

There is a major site at Technion University concerned with approximation theory:

The entry on Sergei Bernstein
Demonstration of Weierstrass' approximation theorem ...
This is a reproduction of Bernstein's original publication in French, which is not otherwise easy to find.
Weierstrass and approximation theory
This is an historical essay by Allan Pinkus.

Bernstein polynomials are not practical for polynomial approximation, but in recent years it has become apparent that they are practical for optimization. The next item is one of the more interesting places where this is discussed.

Roland Zumkeller's web pages
Zumkeller's thesis Global optimization in type theory is available in the publications list. It explains the role of Bernstein polynomials in a project to provide provable estimates of the minimal value of a polynomial on certain domains. This is part of a larger project to produce a formal proof of of Kepler's conjecture along the lines of Thomas Hales' proof. His exposition of the theory of Bernstein polynomials is novel.

Bill Casselman
University of British Columbia, Vancouver, Canada
cass at math.ubc.ca

Those who can access JSTOR can find some of the papers mentioned above there. For those with access, the American Mathematical Society's MathSciNet can be used to get additional bibliographic information and reviews of some these materials. Some of the items above can be accessed via the ACM Portal , which also provides bibliographic services.