Simon Newcomb and "Natural Numbers" (Benford's Law)
Posted May 2009.
He [Newcomb] formulated a law (see below) and gave a rough proof which I will attempt to present...
In 1881 Simon Newcomb (1835-1909), the Canadian-American astronomer and mathematician, published a "Note on the Frequency of Use of the Different Digits in Natural Numbers." For Newcomb, natural numbers were those occurring "in nature," i.e. the kind of numbers one would run into in the course of everyday life. He discovered, for example, that not all the digits (1, 2, ..., 9) occur with the same frequency in the first place of such a number; he formulated a law (see below) and gave a rough proof which I will attempt to present. This law was rediscovered by Frank Benford ("The law of anomalous numbers," 1938) and is now somewhat unfairly known as "Benford's Law." A mathematically sound and complete proof was published by Theodore Hill in 1995.
An experiment with The New York Times
To get some experimental feeling for the phenomenon, I looked at all the numbers given as numerals in the first 15 pages of a recent edition of The New York Times. I omitted dates and advertisements, and repeats (in the same context, in captions or in tables). For each of those 213 numbers I recorded the first digit, and tabulated the data as follows:
For some of the flavor of Newcomb's "natural number" concept, here are the 8 numbers from this set with initial digit 7:
||age of Jane Fonda
||illegal gambling proceeds, Wilkes-Barre, Pa.
||U.S. economic stimulus package, 2/09
||Dow Jones Industrial Average, 2/20/09
||magnitude of hypothetical earthquake
||population of San Francisco
||age of Senator Ronald W. Burris
||low-end starting salary for butler, New York City
Clearly the distribution is very unsymmetrical. Newcomb tells us how he was led to his discovery: "That the ten digits do not occur with equal frequency must be evident to anyone making much use of logarithmic tables, and noticing how much faster the first pages wear out than the last ones. The first significant figure is oftener 1 than any other digit, and the frequency diminishes up to 9." The place where he noticed the phenomenon gave him a clue to its explanation, which he formulated thus:
The law of probability of the occurrence of numbers is such that all mantissae of their logarithms are equally probable.
"Mantissae" probably seems as archaic to today's readers as a starter crank on the front of an automobile, but until 1960 or so every high-school science student was taught the lore of logarithms, and in particular how to use "common" (base-10) logarithmic tables in calculation. Their use involved the separation of a logarithm into two parts: its integer part (the characteristic) and its fractional part (the mantissa). Here is an example:
Suppose, before the days of hand-held calculators, you needed a rapid way to multiply four-digit numbers, and to divide that product by another four-digit number, with an answer accurate to three digits. Say
86.73 X 1.265 X 7607 / .3018.
Procedure: You think of each of the numbers as a power of 10 times a number between 1 and 10:
86.73 = 101 X 8.673
1.265 = 100 X 1.265
7607 = 103 X 7.607
.3108 = 10-1 X 3.108.
When you take logarithms, since log(ab) = log a + log b,
log(86.73) = 1 + log(8.673)
log(1.265) = 0 + log(1.265)
log(7607) = 3 + log(7.607)
log(.3018) = -1 + log(3.108).
The second term in each of the logs is a number between 0 and 1: this will be the mantissa; the leading term is the characteristic. To obtain the log of the answer that we want, log(86.73) + log(1.265) + log(7607) - log(.3018), we make two calculations. First we add or subtract the characteristics; this is an integer calculation. Users of the "slide-rule" (an analogue device conveniently replacing the consultation of logarithmic tables, common through the first half of the twentieth century) would do this part in their heads. In this case the total is 5. Then you consult a four-place logarithmic table for the mantissae:
log(8.673) = .93817
log(1.265) = .10209
log(7.607) = .88121
log(3.018) = .47972.
The mantissae total (with signs) to 1.44175. You chop off the "1" and add it to the characteristic. The log table gives log(2.765) = .44170 and log(2.766) = .44185. Since you only expect 3 places of accuracy, you can take 2.765 as the mantissa contribution to the product, which you calculate as 105+1 X 2.765 = 2,765,000. Feeding the numbers into a digital calculator gives an answer to nine places: 2,765,375.13; but if the factors have an indeterminacy in the fifth place, the fourth digit in the product is not reliable: the extra precision is illusory.
Newcomb first argues that all his "natural numbers" are ratios. This makes sense because most natural numbers are given in units, and the number exhibited is the ratio of some measurement to the same measurement taken on some more or less arbitrary token, e.g. the standard kilogram, the solar year. Then he argues that the set of natural numbers must be closed under further formation of ratios, i.e. under multiplication and division. This implies that the set of logarithms of natural numbers is closed under addition and subtraction; and in particular that the set of mantissae of logarithms of natural numbers is closed under addition and subtraction modulo 1, since as in the example above, when a sum of mantissae is greater than 1 the integer part is moved over to the characteristic; and similarly when it is less than -1. In Newcomb's words: "Since these exponents [the mantissae] are formed by casting off all the integers from a series of numbers, we may suppose them arranged around a circle ..." where we can add and subtract them like angles, except modulo 1 instead of modulo 2π.
Next Newcomb asks the question (translated into our notation): Given a number of points on the circle distributed "according to any arbitrary law," choose n of them at random, say s1, s2, ... sn and form the sum s1 ± s2 ± ... ±sn (modulo 1). What is the probability that this sum will be contained in a given interval of length ds? And he answers: "It is evident that, whatever may be the original law of arrangement," the set of such sums "will approach to an equal distribution around the circle as n is increased," or, in other words, "the required probability will be equal to ds." In other words, The law of probability of the occurrence of numbers is such that all mantissae of their logarithms are equally probable.
This is not evident, but it is plausible. The following figure shows a small simulation of the phenomenon. Here just two "mantissae" s and t, corresponding say to natural numbers m and n, are chosen; the mantissae corresponding to the products minj are plotted around the circle of numbers modulo 1, for i, j running from 0 to 8. Comparison with the logarithms of numbers starting with 1, 2, etc. suggests an explanation for the distribution of these numbers among natural numbers.
a. An illustration of the equal distribution phenomenon Newcomb refers to. Here two numbers s and t are chosen on the circle of circumference 1 (I took numbers corresponding to angles 41o and 95o); the green angles correspond to all the numbers of the form is + jt (modulo 1), for i and j integers between 0 and 8. b. The mantissae corresponding to the integers 1, 2, ..., 9. This is the same display that occurs on a circular slide-rule (see below).
Part of a circular slide-rule designed by John W. Mauchly. Mauchly was one of the designers of the ENIAC, the first large-scale general-purpose electronic computer. There was presumably another, smaller, paper disc with similar gradations that could rotate on top of this one, and probably a rotating pointer for keeping track of locations. (Image courtesy of University of Pennsylvania Libraries.)
It took more than a hundred years for a satisfactory explanation of Newcomb's observation. The main stumbling block was the lack of a precise mathematical concept corresponding to Newcomb's "natural numbers." Theodore Hill realized that base-invariance was the key property: the uniform distribution of mantissae of natural numbers in any base (not only in base 10); this had been already been remarked by Newcomb. As Hill states it, "there is a unique countably-additive base-invariant probability measure on the positive reals."
Frank Benford, The law of anomalous numbers, Proceedings of the American Philosophical Society 78 (1938) 551-572
Theodore P. Hill, Base-invariance implies Benford's law, Proceedings of the A. M. S. 123 (1995) 887-895
Simon Newcomb, Note on the Frequency of Use of the Different Digits in Natural Numbers, American Journal of Mathematics 4 (1881) 39-40