info2

The Mathematics of Communication

## 2. Encoding English text

Considering the encoding of numbers in various bases led us to the conclusion that

• If each digit is one of b equally likely possibilities, the number of bits per digit will be log2(b).

This analysis can be applied to the recording and transmission of messages other than numbers. Suppose the message is a text in English:

THE_QUICK_BROWN_FOX_JUMPED_OVER_THE_LAZY_DOGS_ _

We can treat this message like a number written in base 27 (letters A - Z and _ , the symbol for space). Each character would then be counted as log227 = 4.75 bits.

But there is a difference between the English alphabet and the numbers 0 - 26 that occur in base-27 arithmetic. The assumption we made, that in each place all digits are equally likely, is unrealistic for English where, for example, the letter E occurs more often than others.

Morse Code is a way of representing letters by combinations of dots and dashes. Writing 1 for dot, 2 for dash and 0 for space, Morse Code looks like this:

 A 012 H 02222 O 0222 V 01112 B 0211 I 011 P 01221 W 0122 C 02121 J 01222 Q 02212 X 02112 D 02111 K 0212 R 0121 Y 02122 E 01 L 01211 S 0111 Z 02211 F 01121 M 022 T 02 _ 0 G 0211 N 021 U 0112

Here is a message and its Morse Code (message from E. C. Cherry's The Communication of Information, as quoted in Y. Bar-Hillel, Language and Information, Addison-Wesley 1964, p. 222 ):

 WE_ ARE_ NOT_ 0122010 0120121010 0210222020 CONCERNED_ WITH_ THE_ 0212102220210212101012102101021110 012201102022220 0102222010 MEANING_ OR_ TRUTH_ 0110101202101102102210 022201210 0201210112020222 OF_ MESSAGES_ _ SEMANTICS_ 0222011210 0220101110111012022101011100 01110102201202102011021210111 LIES_ OUTSIDE_ THE_ 012110110101110 0222011202011101102111010 0202222010 SCOPE_ OF_ MATHEMATICAL_ 011102121022201221010 0222011210 0220120202222010220120201102121012012110 INFORMATION_ THEORY_ _ 01102101121022201210220120201102220210 02022201022201210212200

The Morse encoding uses a total of 382 base-3 numbers to represent 129 characters (including the spaces). The weight in bits is then 382 x log23 = 605.4, or 4.69 bits/character. Besides being more practical than a base-27 encoding, Morse Code is cheaper in bits per character. This is because Morse Code takes advantage of the differences in relative frequency of characters in an English text.

Is it possible that a better encoding could reduce the cost in bits/character even further?