info2

**The Mathematics of Communication**

## 2. Encoding English text

Considering the encoding of numbers in various bases led us to the conclusion that

- If each digit is one of
*b* equally likely possibilities, the number of bits per digit will be log_{2}(*b*).

This analysis can be applied to the recording and transmission of messages other than numbers. Suppose the message is a text in English:

THE_QUICK_BROWN_FOX_JUMPED_OVER_THE_LAZY_DOGS_ _ We can treat this message like a number written in base 27 (letters A - Z and _ , the symbol for space). Each character would then be counted as log_{2}27 = 4.75 bits.

But there is a difference between the English alphabet and the numbers 0 - 26 that occur in base-27 arithmetic. The assumption we made, that in each place all digits are equally likely, is unrealistic for English where, for example, the letter E occurs more often than others.

Morse Code is a way of representing letters by combinations of dots and dashes. Writing 1 for dot, 2 for dash and 0 for space, Morse Code looks like this:

A 012 | H 02222 | O 0222 | V 01112 |

B 0211 | I 011 | P 01221 | W 0122 |

C 02121 | J 01222 | Q 02212 | X 02112 |

D 02111 | K 0212 | R 0121 | Y 02122 |

E 01 | L 01211 | S 0111 | Z 02211 |

F 01121 | M 022 | T 02 | _ 0 |

G 0211 | N 021 | U 0112 | |

Here is a message and its Morse Code (message from E. C. Cherry's *The Communication of Information*, as quoted in Y. Bar-Hillel, *Language and Information*, Addison-Wesley 1964, p. 222 ):

WE_ | ARE_ | NOT_ |

0122010 | 0120121010 | 0210222020 |

CONCERNED_ | WITH_ | THE_ |

0212102220210212101012102101021110 | 012201102022220 | 0102222010 |

MEANING_ | OR_ | TRUTH_ |

0110101202101102102210 | 022201210 | 0201210112020222 |

OF_ | MESSAGES_ _ | SEMANTICS_ |

0222011210 | 0220101110111012022101011100 | 01110102201202102011021210111 |

LIES_ | OUTSIDE_ | THE_ |

012110110101110 | 0222011202011101102111010 | 0202222010 |

SCOPE_ | OF_ | MATHEMATICAL_ |

011102121022201221010 | 0222011210 | 0220120202222010220120201102121012012110 |

INFORMATION_ | THEORY_ _ | |

01102101121022201210220120201102220210 | 02022201022201210212200 | |

The Morse encoding uses a total of 382 base-3 numbers to represent 129 characters (including the spaces). The weight in bits is then 382 x log_{2}3 = 605.4, or 4.69 bits/character. Besides being more practical than a base-27 encoding, Morse Code is cheaper in bits per character. This is because Morse Code takes advantage of the differences in relative frequency of characters in an English text.

Is it possible that a better encoding could reduce the cost in bits/character even further?