Words and More WordsHere I will take a brief look at some of the ways mathematics has looked at words in the hope of exploiting the resulting ideas....
"Words! Words! I'm so sick of words!," Eliza Doolittle declares in her outburst dealing with the indignities Henry Higgins has put forth in her direction in the classic musical comedy My Fair Lady. On one level one empathizes with Eliza but words are special. How do they fit into the mathematical world rather than the worlds of English and language?
Tiling inspired by the Thue-Morse sequence, Courtesy of Mark Dow, U. of Oregon
Fibonacci numbers and words
When you see numbers like 14523, 32145, and 11223 there are various patterns you might notice and "facts" associated with the numbers. The first two have all distinct digits. For 11223 the digits used include no digits beyond those used in the first two. All of these numbers have 5 digits and the second one can't be prime. (Primes are those positive integers bigger than 1 whose only factors are 1 and themselves.) Implicit in this discussion is that we are looking at numbers base 10, with the digits 0, 1, ..., 9. However, there is another point of view about these "numbers." They can be viewed as strings or "words." When we in America think about words we usually think about English words like scared, sacred, and scarred. These words have different meanings in English but they all use the same letters,
0, 1, 1, 2, 3, 5, 8, 13, 21, 34, .... (*)
and this sequence of letters:
a, ab, aba, abaab, abaababa, abaababaabaab, (**)
Note that all the strings of the sequence (**) begin with the letter a. For the first sequence (*) we got from one term to the next in the sequence by thinking of a numerical pattern. The next number in the sequence is generated by adding the two previous terms of the sequence (list).
We started sequence (*) with n = 0 so the formula above is valid for n equal to 2 or greater, and F0 = 0 and F1 = 1. (Sometimes it is convenient to start this sequence 1, 1, 2, .... rather than as has been done here.) In addition to the recursion equation (computes later values in a sequence in terms of prior ones) given above we need "initial" values or "conditions" for this recursion equation to "act" on.
a, ab, aba, abaab, abaababa, abaababaabaab, .....
The "fancy" term for this process is concatenation: we concatenated two previous strings to get the next string. We needed initial conditions on which the recursion equation can act. It is traditional to denote concatenation using the symbol for "multiplication," as we have done above. Note that the lengths of these words are 1, 2, 3, 5, 8, 13, .... which are the numbers of the Fibonacci number sequence (starting with the third term of the way that sequence is generally described). f4 has 5 a's and 3 b's, while f5 has 8 a's and 5 b's. Can you see what the pattern is here?
f = abaababaabaab...
a, ab, aab, abaab, aababaab, abaabaababaab, ....
You can see that this sequence is different from (**) since none of the strings in (**) begin with aa, while above the strings eventually alternately begin with aa and with ab. Can you see how the sequence above is related to (**)? Below the two sequences are lined up to help you see what is going on:
a, ab, aba, abaab, abaababa, abaababaabaab, .....
a, ab, aab, abaab, aababaab, abaabaababaab, ....
0, 01, 010, 01001, 01001010, 0100101001001,
Leonardo Fibonacci (circa 1170-1250)
Fibonacci looked at the sequence (*) in conjunction with a question about counting rabbit pairs in a special way. The Fibonacci sequence (Fibonacci numbers) is among the most studied in all of mathematics. Not only does it display wonderfully pleasing patterns but it has many applications in unexpected places, including in computer science to searching through and sorting integers.
Combinatorics on words
What today is known as the combinatorics of words is tied to an important sequence in the spirit of the infinite Fibonacci word we have looked at already, which is now known as the Thue-Morse sequence or, sometimes, as the Prouhet-Thue-Morse sequence. Before saying more about this sequence itself, let me recount some of the history here.
(Axel Thue (1863-1922))
He was the only doctoral student of Elling Holst who in turn was a student of the famous Norwegian mathematician M. Sophus Lie. Axel Thue had only one student, the logician Thoralf Skolem, who achieved much more fame than his thesis advisor. Thue was primarily a number theorist. His most famous work was on the approximation of real numbers using rational numbers. but he became interested in the behavior of strings, seemingly because of his interest in the theory of groups (abstract algebra). In all, Thue wrote several papers about what has come to be called the combinatorics on words. In reviewing Thue's collected works, published in 1977, Wolfgang Schmidt writes: "Thue was far ahead of his time." This is reflected in the fact that much of what he did had no immediate effect on research about combinatorics of words. Partly because his work was published in "offbeat" journals, it was not picked up immediately; it was only noticed much later how important his ideas were.
(Marston Morse (1892-1977))
Morse independently of Thue discovered the Thue-Morse sequence in the context of work in what has come to be called symbolic dynamics, which is part of the broader subject of dynamical systems that arises in topology and analysis. Loosely speakimg, the idea is to study what happens when a function is applied to an input value, and then applied again to the output, continuing in this manner and generating what are called "orbits."
Before dealing with the truly extraordinary Thue-Morse sequence, let me say a bit more about how better known issues are related to some topics that come up in combinatorics of words.
Now, at last, the Thue-Morse sequence. We have already noted above that any infinite word made up of zeros and ones must contain a square. A natural question is to see what kinds of patterns can be avoided in an infinite word of zeros and ones. Before you see the Thue-Morse sequence, consider the following result.
Using the entries in column 3 of the table above we can write down the first 17 entries in the T-M sequence. Here they are:
0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1
If we omit the commas and extend the table, here is the (start) of the infinite word version of the T-M sequence:
T-M = 01101001100101101001011001101001....
To cement what is going on here, suppose you wanted to know what the 50th entry in T-M is. Since 50 when written in binary is 110010, which has 3 one's, we know that the TM50 = 1.
One infinite word that rivals this one in interest is the infinite version of the Fibonacci words that we discussed earlier but using the alphabet of a and b rather than the 0 and 1 symbol alphabet. Here for comparison is an initial segment of the infinite Fibonacci word:
f = 010010100100101001010......
Here on top of one another are the Thue-Morse sequence (on top) and the Fibonacci word:
Are there other ways of generating T-M?
Here is a recursive approach to defining the entries of the sequence!
TM0 = 0
(a) TM2n = TMn and (b) TM2n+1 is not equal to TMn
What is going on here? When a number is doubled, what happens to its binary representation? The answer is that a zero gets added at the end which means that the number of 1's in the representation stays the same.
So TM0 = 0 and using (b) with n = 0 we get that TM1 = 1. Now n = 1, we use (a) to conclude that TM2 is the same as TM1 or 1, while using (b) with n = 1 we see that TM3 should be different from TM1 and, thus, be 0. Continuing on in this manner we get the same values as we did previously.
If you think through the meaning of the way that the T-M is generated above, there is another nifty way of generating the T-M:
From the sequence listed from the 0th term to the 2nth term, we will see how at each stage we can double the number entries we have in the 'infinite word." Write down those terms from the sequence and now write down this string next to the original but with the roles of 0 and 1 interchanged. Here is what you get when you carry this out:
which you can check is identical with what we had above but the work is much easier!
The more historical approach to generating the T-M sequence, however, involves what have come to be called a "morphism" approach. The basic idea is to start with some simple word and replace (using a function) each of the letters used in this word with some string to get a new string. Iterating this function over and over will generate longer and longer words, and the "limit" of these words is the infinite word.
In words, given a starting string S, replace 0 in the string by 01 and 1 in the string by 10. Use 0 as your starting string.
So here is what happens:
As you can see, this is our "old friend" the T-M sequence.
For those familiar with function notation here is how one expresses this:
tm(0) = 01; tm(1) = 10. T-M can be written tm∞(0). This last means tm(tm(tm...(tm(0)....))). We compute the result of applying the function to get an answer starting with the string 0, and repeat applying the "substitution map" over and over again.
So, what is it that is interesting about the Thue-Morse sequence? We have already seen that with only an alphabet of two symbols we can't construct long words that don't have squares (e.g. strings like xx). This raises a variety of questions both of theoretical and applied interest:
a. What are the fewest symbols that an alphabet can have and yet admit arbitrarily large square-free words?
We will see that Axel Thue was able to construct a sequence of this kind, employing ideas related to the T-M sequence, using only an alphabet of 3 symbols--a remarkable result.
b. To what extent does the T-M sequence avoid having overlaps? An overlap in an infinite (one-way) string is a substring which looks like aZaZa where a is any single letter in the alphabet and Z is any string, possibly of length 0. When Z has length 0, then the pattern aZaZa becomes a single letter cubed (aaa). If there is no pattern of the kind aZaZa, then the (infinite) word is said to be overlap-free.
Why are overlaps of interest? Years after Thue's work biologists are trying to understand the meaning of strings of DNA in different species. Each species has a collection of genes which contain the genetic information for the species. For a long time the DNA outside the stretches which "code" information was thought to be "junk." This referred to DNA that was there but did not have any part in inheritance issues. We now know the situation is much more complex, and that DNA outside of sections that are genes may have "regulatory" or other information that governs the way(s) genes are "expressed." Very often geneticists are interested in "tandem repeats." Locating such sections in DNA are valuable in determining a particular person's parentage and other aspects of genome determination.
Theorem: The Thue-Morse infinite word (which uses only two symbols) is overlap-free.
Here is a way to construct a word that is square free that uses only three symbols and which is based, you guessed it, on the Thue-Morse sequence.
We will use only the symbols 0, 1, 2 in the word we construct, and we will denote the sequence SFn (for square free) where n = 1, 2, 3, .....
SFn will denote the number of 1's between the nth and (n+1)st zeros in the T-M sequence. Here is a stretch of the T-M sequence:
SF1 is 2, since first 1 is in position 0, and the second 0 is in position 3, so there are two one's between these positions.
SF2 is 1 since the second 0 is in position 3 and the third 0 is in position 5 so there is one one between them.
SF3 = 0
SF4 = 2
SF5 = 0
SF6 = 1
SF7 = 2
which gives the initial section of the SF infinite word:
It is easy to see that the distance between two ones in the T-M sequence can only be 0, 1, or 2. Otherwise we would have as many as three ones in a row, the word 111, which would mean that the T-M sequence contained a cube, which (though we did not include the proof) cannot occur. Were this sequence to have a square, it can be shown that the T-M sequence would have to have an overlap, which cannot be the case. One can also find morphism approaches to constructing square free words with an alphabet of only size three.
The Thue-Morse, though seemingly having a primarily combinatorial and algebraic quality, has also inspired geometrical work and artistic work. Tilings, freezes, knots, and fractals all have been investigated inspired by the Thue-Morse sequence. A sample appears below (and at the beginning of the column), due to Mark Dow who works at the brain development laboratory at the University of Oregon.
(Courtesy of Mark Dow, University of Oregon)
In recent years there has been a tremendous explosion of results in the combinatorics of words, connections being found to matrix theory, automata theory, graph theory, etc. I think Eliza Doolittle would have approved no matter what Henry Higgins might have said!
Allouche, J-P. and J. Shallit. The ubiquitous prouhet-thue-morse sequence, in Sequences and their applications, pp. 1-16. Springer, London, 1999.
The AMS encourages your comments, and hopes you will join the discussions. We review comments before they're posted, and those that are offensive, abusive, off-topic or promoting a commercial product, person or website will not be posted. Expressing disagreement is fine, but mutual respect is required.
Welcome to the
These web essays are designed for those who have already discovered the joys of mathematics as well as for those who may be uncomfortable with mathematics.
Search Feature Column
Feature Column at a glance