The intent of this column is to explain how [Claude] Shannon proposed that we measure information and how this measure helps us understand the rate at which it can be transmitted. ...
Tech news these days is filled with stories about the upcoming 5G revolution in wireless networking. And though the revolution seems a little slow getting started, we're promised it will bring significantly faster transmission rates and shorter delays.
While there are several reasons to expect improvements, one significant factor is the use of polar codes to encode information for transmission. Introduced by Erdal Arikan in 2009, polar codes are optimal, in a precise sense describd by information theory, and well suited for this application.
Originally, I intended for this column to be about polar codes, but, after reading around for a while, I thought it could be useful to back up and describe how information is quantified. Much of this theory originates with Claude Shannon's seminal 1948 paper, which is brilliant but can be challenging to read. On the other hand, many recent expositions aim for more rigor and lose some of the intuition that newcomers may seek to develop a clear sense of what the subject is ultimately about.
So the intent of this column is to explain how Shannon proposed that we measure information and how this measure helps us understand the rate at which it can be transmitted.
The first thing to know is that information, as quantified by Shannon, is a measure of the amount of choice we have in expression.
In Grand Rapids, Michigan, where I live, a local television station frequently promotes its aptly named Weatherball, a large ball, prominently displayed on a tower, whose color gives a simple weather forecast:
Weatherball red, warmer ahead.
Weatherball blue, cooler in view.
Weatherball green, no change foreseen.
To Grand Rapids residents, the Weatherball is a much-loved icon inspiring an amount of affection that, in spite of my 20 years of residence, has never been adequately explained. To Shannon, however, the Weatherball is an information source capable of transmitting messages composed of three symbols: red, blue, and green. His measure of information isn't defined by the symbol transmitted at any one time, but rather by the fact that there are three choices for symbols. This would be a low information source.
In contrast, a forecast from the National Weather Service gives tomorrow's predicted high temperature. As an information source, there is greater choice in the message transmitted, which makes this a higher information source. For instance, if we see that the high temperature today will be 25 degrees, we also know that it won't be 37 or 62.
So our measure of information will relate to the number of possible symbols we can produce. There is, however, also a statistical component to this measure.
I haven't analyzed the Weatherball to know whether one color is displayed more frequently than others, but it seems possible that the colors (red, blue, and green) are displayed equally often. That is, the probability of seeing one of the colors is 1/3.
Suppose instead that we imagine another Weatherball in Pleasantville, a city in a more moderate climate where the temperature from one day to the next doesn't change a lot. In that case, green ("no change foreseen") would be more likely than the other colors, red or blue. Let's imagine that the probability of seeing each of the colors is:
Green: | 1/2 |
Red: | 1/4 |
Blue: | 1/4 |
While a typical sequence in Grand Rapids might look like this:
a typical sequence in Pleasantville may be:
Shannon tells us that the Weatherball in Grand Rapids is a higher information source than Pleasantville's in spite of the fact that they use the same three symbols. Here's why. Suppose we want to transmit the Weatherball's state every day over a wireless network. We first need to encode it into a suitable format, and a binary code seems like a good candidate. We can leverage the fact that we are more likely to see green in Pleasantville to create an efficient encoding. Consider, for instance, the encoding:
Green: | 0 |
Red: | 10 |
Blue: | 11 |
That is, we use a shorter code for the symbol we see more frequently. The average number of binary digits sent is then $$ \frac12\cdot 1 + \frac 14\cdot 2 + \frac14\cdot 2 = \frac32. $$ That is, we transmit, on averate, 3/2 binary digits per symbol.
In Grand Rapids, the average number of digits sent is $$ \frac13\cdot 1 + \frac 13\cdot 2 + \frac13\cdot 2 = \frac53, $$ which is larger than the 3/2 binary bits per symbol in Pleasantville.
The fact that we can squeeze the Pleasantville symbols into a smaller number of binary digits without losing information indicates that the Pleasantville Weatherball is a lower information source. (Of course, we've only looked at one encoding so there may be better choices for Grand Rapids.)
Remember that we can think about information as a measure of the freedom we have to choose a message. Since the Weatherball in Pleasantville causes us to choose green more frequently than red or blue, our choice is somewhat restricted, which lowers the amount of information.
The real power of this interpretation, as we will see later, is that we can use it to understand the maximal rate at which information can be transmitted.
Embedded in this example is a pipeline, as illustrated on the left.
We begin with an information source, the Weatherball in our example, that produces a message as a sequence of symbols. These symbols are encoded, in some way, into a new set of symbols that is suitable for transmission through a communication channel.
The channel could be any means of conveying encoded messages, such as a wireless network, a telegraph line, or semaphore flags. As illustrated here, the message that enters the channel is the same one that exits. It's possible that the message is somehow corrupted in the channel, and this can be built into the theory, though we won't do so here.
If we are given an information source and a channel, our central task will be to find an encoding that maximizes the rate at which information is transmitted through the channel.
Our first task will be to state explicitly what we mean by an information source and then quantify the information generated by it.
As in the examples above, an information source will produce a message composed of a sequence of symbols. The Weatherball might produce $$ GRBRGRBRGRBRGRBGRBGGGRRBRGR\ldots $$ if we record the color every day for a period of time. However, the source also includes statistical data about the frequency with which the symbols appear.
Shannon describes the English language as an information source. In its simplest form, this source is capable of generating strings of letters and other punctuation symbols. Focusing just on the 26 letters, we know that letters are not equally likely; for instance, "E" appears much more frequently than "Z."
But there is additional statistical information besides the frequencies with which letters appear. For instance, the letter "Q" is almost always followed by "U" (though Scrabble players know a few other options). That is, the probability of encountering a symbol may depend on what comes before it.
For this reason, we consider an information source to be a Markov or stochastic process. At any time, the source is in a state $S$, and the probabilities for generating the next symbol depend on $S$.
We can represent this situation graphically, as in the following example. Suppose our source generates messages with three symbols $A$, $B$, and $C$ and that the state we are in depends only on the last symbol generated. We label the states with the last symbol and represent the probabilities for generating the next symbol, and hence moving to the next state, as shown below:
If we consider an information source in which the probabilities for the symbols are independent of what has come before, then our Markov process has a single state that we never leave.
The processes we consider will always be ergodic, which has the following intuitive interpretation: Two very long messages produced by the source should have similar statistical properties. For instance, consider two typical novels written in English. Since English is an ergodic information source, we expect that the frequencies with which letters, digrams, trigrams, and so forth appear should be nearly the same for the two novels.
Following Shannon, we would like to create a measure of information associated to a Markov process. For the time being, suppose that the $n$ symbols appear with probabilities $p_1$, $p_2$, ..., and $p_n$. Our measure $H(p_1,p_2,\ldots,p_n)$ should satisfy the following criteria:
$H$ is continuous in the probabilities $p_i$.
Adding more symbols creates more choice and, hence, more information. More specifically, considering sources where each of $n$ symbols is equally likely, that is, $p_i = 1/n$, we expect $H$ to increase with $n$.
If the process of choosing a symbol can be broken into two successive choices, then $H$ adds appropriately. For instance, consider the following description of the Pleasantville Weatherball: we first choose, with equal probability, whether the weather will change or not. If the weather changes, we choose, again with equal probability, whether it warms up or cools down. The following diagram illustrates:
In this case, we want $$ H\left(\frac14,\frac14,\frac12\right) = H\left(\frac12,\frac12\right) + \frac12H\left(\frac12,\frac12\right). $$
Shannon shows that these three properties determine $H$ up to a multiplicative constant, which leads to the definition: $$ H(p_1,p_2,\ldots,p_n) = -\sum_i p_i\log_2(p_i). $$
A similar quantity appears in statistical mechanics as entropy, a measure of a system's disorder, so we call $H$ the entropy of the information source.
Here are some examples.
Suppose the information source generates two symbols, 0 and 1, with equal probability. Then, $$ H\left(\frac12,\frac12\right) = -\frac12\log_2\left(\frac12\right) -\frac12\log_2\left(\frac12\right) = 1. $$
More generally, if the source generates 0 with probability $p$ and 1 with probability $1-p$, then $$ H(p) = -p\log_2(p) - (1-p)\log_2(1-p), $$ whose graph is
Notice that $H(0) = H(1) = 0$. For example, if $p=0$, the source is forced to generate only 1's so that its only possible message is 11111111111.... Here, the source has no information since we know what the message will be before even receiving it.
We see that $H$ is a maximum when $p=1/2$, when both 0 and 1 appear with equal probability, since this means the source offers the greatest freedom of choice when generating messages.
As this source generates one binary digit per symbol, we say that the units on $H$ are bits per symbol. As the graph above shows, however, we shouldn't equate bits with binary digits since $H$ is not necessarily an integer.
Another example Shannon describes is a teletype, an electromechanical device for transmitting messages. Pressing a key on the keyboard of a teletype machine punches some combination of five possible holes on a piece of paper tape. When read back, each of five electrical contacts open or close depending on pattern of holes punched in the tape. In this way, each of $2^5 = 32$ keys are encoded in a system of five binary digits. If each key is pressed with equal probability, the entropy of this source is $\log_2(32) = 5$ bits per symbol. In this example, a bit faithfully corresponds to a binary digit.
The Grand Rapids Weatherball has three symbols, which we assumed appear with equal probability. This leads to $H = \log_2(3) \approx 1.58.$ By contrast, the Weatherball in Pleasantville has $H = 3/2 = 1.5$, confirming our earlier suspicion that the Weatherball in Pleasantville is a lower information source. In fact, we encoded the Pleasantville source into a stream of binary digits using $3/2$ bits per symbol, which implies, as we'll see later, that this is the best possible encoding.
The entropy of the English language may be understood through a series of approximations. For convenience, let's focus only on the 26 letters. As a zero-th order approximation, Shannon begins with the source having 26 symbols, each appearing with equal probability. We then have $H=\log_2(26) \approx 4.70$.
As a first-order approximation, we imagine that the letters are generated with their naturally occurring frequencies so that, for instance, "E" appears with probability 0.12 and "Z" with 0.02. We would expect that the entropy decreases from the zero-th order approximation since we are now imposing some restrictions on the source. Indeed, we find that $H=4.18$.
As mentioned earlier, the probability that a given letter is generated depends on the letter before it, which leads to a second order approximation. For instance, when "Q" appears, there is a probability of 0.92 that "U" comes next but only 0.02 that "A" follows. Our source is now a Markov process with states $S$ defined by the last letter generated. For each state $S$, we find an entropy $H(S)$ using the probabilities of generating the next letter. The entropy of the Markov process is then the weighted average: $$ H = \sum_i P_i~H(S_i) $$ where the sum is over all states $S_i$ and $P_i$ is the probability of being in state $S_i$ at any time. We find the entropy to be $H = 3.66$.
We may, of course, consider a third-order approximation where a state is determined by the previous two letters generated; higher-order approximations appear in the same way. With increasing order, messages generated start to look more and more like intelligible English.
To summarize, we have
Order | Entropy |
0 | 4.70 |
1 | 4.18 |
2 | 3.66 |
Given an information source, Shannon defines the relative entropy to be the ratio of the entropy to the maximum entropy using the same set of symbols. For instance, the Pleasantville Weatherball has an entropy of 3/2 bits per symbol. If we give the same three symbols equal probability, as in the Grand Rapids Weatherball, we have a maximal entropy of $\log_2(3)$. The relative entropy of the Pleasantville Weatherball is then $(3/2)/\log_2(3) = 0.946$. The redundancy of an information source is one minus the relative entropy so, in this case, we have a redundancy of about 5.4%, which presents an opportunity to compress the information, which we did above by encoding it into a sequence of binary digits.
Using the approximations described above, Shannon estimated the entropy of English to be about 2.3 bits per letter, giving a redundancy of about 50%. This means that it's possible to delete half the letters in an English message and reconstruct the meaning of the whole, a fact that's not surprising to anyone who texts or tweets frequently. This is also good news for those of you who only read half of Moby Dick, provided you read an appropriate half.
So far, we have described entropy as a measure of the amount of freedom we have in creating a message. Let's examine a couple of implications.
First, suppose that an information source with entropy $H$ has $n$ symbols appearing with probabilities $p_1$, $p_2$, ..., $p_n$ and that this source generates a message as a long sequence having $N$ symbols. Each symbol appears about $p_iN$ times in the message so that the probability of obtaining that message is approximately $$ p \approx p_1^{p_1N} p_2^{p_2N} \ldots p_n^{p_nN}. $$ Notice that $$ \log_2(p) \approx p_1N\log_2(p_1) + p_2N\log_2(p_2) + \ldots p_n\log_2(p_n) = -NH. $$ so that $$ p \approx 2^{-NH}. $$
This leads to two important conclusions. First, the probability of obtaining a long message is approximately constant. This is perhaps not too surprising if we remember our assumption that the information source is ergodic. We also conclude that there are about $2^{NH}$ such sequences. Notice that larger values of $H$ imply a larger number of messages, reinforcing our interpretation of $H$ as a measure of our freedom to choose a message.
This observation can be made more precise by stating that the messages of $N$ symbols, as $N$ grows large, can be divided into two sets: one is a set of about $2^{NH}$ messages each of which occurs with roughly equal probability (we will refer to these as high probability messages) and the other is a set whose total probability becomes vanishingly small as $N$ goes to infinity.
Relative entropy provides additional meaning here. Suppose we have an information source $I$ whose relative entropy is $r$ and redundancy is $\rho = 1-r$. If $\overline{H}$ is the maximal entropy of an information source having the same set of $n$ symbols as $I$, then $\overline{H} = \log_2(n)$. This maximal source produces $n^N=2^{N\overline{H}}$ messages of length $N$, each of which is equally probable. Contained within this set is a smaller set of $2^{NH}$ high probability messages for the information source $I$. Therefore, the fraction of high probability messages for $I$ is $$ 2^{NH}/n^N = 2^{N (H-\overline{H})} = 2^{-N\overline{H}(1-r)} = 1/(n^N)^{1-r} = 1/(n^N)^\rho $$ where $\rho$ is the redundancy of $I$.
Notice that, if $\rho=0$, there is no redundancy so every possible message is a high probability message. In contrast, we said that the redundancy of English is about $\rho=1/2$. If we take the 26 letters of the alphabet and randomly form sequences of 100 letters, then about $$ 1/(26^{100})^{1/2} = 1/26^{50} \approx 1.8\times 10^{-71} $$ of those will be meaningful English sequences. In this way, we could quantify how long it would take a randomly typing monkey to write a meaningful sentence of 100 characters.
We've been looking at entropy as a measure of information and have seen how it intuitively feels like a useful measure. With a little more work, however, we will see that entropy is exactly the right measure for determining the rate at which information can be transmitted.
To begin, recall our pipeline. We have an information source generating messages as sequences of symbols. We would like to transmit messages through a channel, which could be something like a wireless network or telegraph line. We can also encode the messages generated by the source in an effort to (a) more efficiently transmit them and (b) improve the reliability of the transmission in the event there is noise in the channel that might corrupt the message.
While Shannon presents a general framework for describing the rate at which the channel can transmit symbols, we will consider a simplified version that retains the essential ingredients. To that end, suppose our channel transmits $C$ binary digits in one time unit. In $T$ time units, the channel is capable of transmitting $CT$ digits and hence $2^{CT}$ possible messages.
Suppose that our source generates symbols at the rate of $R$ symbols per unit time. In $T$ time units, we have $RT$ symbols and therefore $2^{RTH}$ high probability messages. To reliably transmit these messages in $T$ units of time, we need to be capable of sending more than the number of high probability messages: $$ 2^{CT} \geq 2^{RTH} $$ and hence $CT \geq RTH$. This shows that the transmission rate is bounded by $$ R \leq \frac CH $$ symbols per unit time. In this way, the entropy provides an upper bound for the rate at which symbols can be transmitted. Higher entropy leads to more information, which leads to a lower rate.
So the transmission rate is bounded by $C/H$, but we can say more. In fact, it's always possible to find an encoding that allows a transmission rate as close to $C/H$ as we wish, and Shannon shows us how to construct this encoding using what we now call a Shannon code.
First, note that we are going to encode messages of $N$ symbols rather than individual symbols. Let's begin by collecting all possible messages of $N$ symbols and recording the probability with which they are generated. Next, we'll order the messages so that their probabilities are non-increasing. That is, if $q_i$ is the probability of the $i^{th}$ message, then $$ q_1 \geq q_2 \geq q_3 \geq \ldots. $$ By $C_i$, we will mean the cumulative probability of this sequence so that $$ C_i = \sum_{j\lt i} q_j. $$ Here's the punchline: the $i^{th}$ message will be encoded as the binary expansion of $C_i$ using $d_i=\lceil\log_2(1/q_i)\rceil$ binary digits.
That probably seems like a lot to take in so let's return to Pleasantville and look at an example. Remember we have three symbols $G$, $R$, and $B$ and probabilities $p_G = 1/2$, $p_R = 1/4$, and $p_B = 1/4$. Also, the entropy is $H=3/2$.
We will encode messages of length $N=1$, which are just single symbols. Ordering these three messages, we have the following table, which includes the binary expansion of $C_i$. $$ \begin{array}{c|c|c|c|c} {\bf Message} & q_i & C_i & d_i = \lceil\log_2(1/q_i)\rceil & {\bf Encoding} \\ \hline G & 1/2 & 0=0.0 & 1 & 0 \\ R & 1/4 & 1/2 = 0.10 & 2 & 10 \\ B & 1/4 & 3/4 = 0.11 & 2 & 11 \\ \end{array} $$
Therefore, we have found the same encoding we used earlier:
Green: | 0 |
Red: | 10 |
Blue: | 11 |
Let's consider another information source with symbols $A$, $B$, $C$ generated with probabilities $p_A = 2/3$, $p_B = 1/6$, and $p_C=1/6$. The entropy of this source is $$ H = -\frac23\log_2(2/3) - \frac16\log_2(1/6) - \frac16\log_2(1/6) \approx 1.25 $$ bits per symbol.
As before, let's consider encoding messages consisting of $N=1$ symbols. We then find $$ \begin{array}{c|c|c|c|c} {\bf Message} & q_i & C_i & d_i = \lceil\log_2(1/q_i)\rceil & {\bf Encoding} \\ \hline A & 2/3 & 0 = 0.0 & 1 & {\rm 0} \\ B & 1/6 & 2/3 = 0.101 & 3 & {\rm 101} \\ C & 1/6 & 5/6 = 0.110 & 3 & {\rm 110} \\ \end{array} $$
The average length of a message is $$ L_1 = \frac23\cdot 1 + \frac 16\cdot 3 + \frac 16\cdot 3 = \frac53 \approx 1.66. $$ Notice that $L_1 = 1.66 > 1.25 = H$, which means that we can use this encoding to transmit at the rate $C/L_1$, which is only 75% of the optimal rate of $C/H$.
Let's see if there is an improvement if we encode messages composed of $N=3$ symbols. There are $3^3 = 27$ such messages so rather than writing the encoding explicitly, we will focus on finding $L_3$, the average number of bits per symbol. The table below groups messages according to their probability and records the number in each group under the column labeled "#". $$ \begin{array}{c|c|c|c|c|c} {\bf Message} & q_i & {\rm \#} & d_i = \lceil\log_2(1/q_i)\rceil \\ \hline AAA & 8/27 & 1 & 2 \\ AAB,... & 4/54 & 6 & 4 \\ ABB,... & 2/108 & 12 & 6 \\ BBB,... & 1/216 & 8 & 8 \\ \end{array} $$ This means that the average length per symbol is $$ L_3 = \frac13\left(\frac{8}{27} \cdot 2 + \frac{4}{54}\cdot 6 \cdot 7 + \frac{2}{108}\cdot12\cdot6 + \frac{1}{216}\cdot8\cdot8\right) = \frac43 \approx 1.33. $$ Using this encoding, we are able to transmit at the rate of $C/L_3$, which is about 94% of the optimal rate of $C/H$. This improvement is due to the fact that more frequently occurring messages are encoded with a shorter string of bits.
As we see, the length 3 encoding results in greater compression, which produces an improvement in the transmission rate. We might expect that encoding messages of even greater length will offer continued improvement. In fact, Shannon proves that $L_N$ is a non-increasing function of $N$ that approaches $H$. In this way, it is possible to create an encoding that gives a transmission rate as close as possible to the optimal rate of $C/H$.
If you work out a few examples, you'll be able to see that each message is uniquely encoded and that an encoded message can be uniquely decoded, requirements that are essential for properly decoding messages.
We have now seen what Shannon calls the Fundamental Theorem for a Noiseless Channel:
Given an information source with entropy $H$ and a channel transmitting at $C$ bits per unit time, it is possible to find an encoding that leads to a transmission rate as close to $C/H$ symbols per unit time as desired. It is not possible to transmit at a rate greater than $C/H$.
Of course, there are some drawbacks. First, we need to enumerate and sort all the messages with $N$ symbols, and, when $N$ is large, this is likely to be difficult. Second, encoding long messages may result in delays in transmission. Imagine, for instance, having a phone conversation in which a sentence cannot be transmitted until it has been completely spoken.
Shannon's code is useful for demonstrating that the optimal rate of transmission is approachable, but it is only one possible encoding. Perhaps another encoding scheme will approach the optimal rate more quickly or even equal the optimal rate. That's where Arikan's polar codes enter our story, but that message is being encoded and will be transmitted over this channel in June!
In this column, we've been looking at transmission over a clear channel; that is, what comes out of the channel is exactly what went in to it. Anyone who's had a conversation by cell phone knows it doesn't work that way, as noise frequently garbles the message. (It's ironic that, among all its capabilities, a cell phone performs worse at being an actual telephone.)
Shannon's paper also considers noisy channels and develops a measure of how much information can be transmitted over such a channel. A key idea is conditional entropy. If the symbols going into the channel form a set $X$ and the symbols coming out form $Y$, we would hope that knowing we received $y$ should enable us to determine a unique $x$ that went in to the channel to produce $y$. Noise, however, means that we can't be completely confident about the input $x$ so instead we consider the probabilities of different inputs $x$ leading to $y$. That is, we consider conditional probabilities $p(x|y)$, which leads to a conditional entropy, a measure of the amount of uncertainty in reconstructing $x$ if we know $y$. A noiseless channel will have zero conditional entropy since there is no uncertainty.
Using conditional entropy, Shannon is able to determine the optimal rate at which information can be transmitted through a noisy channel. This result may initially seem surprising: if the channel is noisy, how can we trust anything that we receive? Redundancy in the source comes to the rescue, however. For instance, readers will be able to make sense of this sentence that emerged from a noisy channel:
Npisy chamnels still ttansmkt ingormqtion.
In this column, we've been looking at discrete information sources so it's worth noting that Shannon also developed a theory that applies to continuous sources.
Entropy, in fact, appears with some ubiquity. We have already mentioned its appearance in statistical mechanics. Ecologists use a similar quantity to measure biodiversity. Once you start thinking about information sources, you may begin to see them all around you. In particular, you may wonder whether mathematics itself is an information source that is encoded in a highly compressed format using mathematical notation. What is its entropy and its redundancy?
Finally, readers are surely eager to know more about the Grand Rapids Weatherball so I will share that the current ball, pictured above, is the second incarnation. The first Weatherball, located on the roof of a downtown building, weighed 64 tons. We've already said this is a low information source, but now we can see that the entropy per pound is astonishingly small.
Claude Shannon and Warren Weaver. The Mathematical Theory of Communication. University of Illinois Press. 1964.
This volume contains Shannon's original paper and also includes, as a preface, a non-technical overview of the subject and summary of Shannon's results by Weaver.
Claude Shannon. Prediction and entropy of printed English. Bell System Technical Journal. 30.1 (1951): 50-64.
This paper describes Shannon's efforts to estimate the entropy of the English language.
English Letter Frequencies, Practical Cryptography.
This site provides the frequencies of letters and digrams that I used to compute the entropy of the approximations to the English language.