Skip to Main Content
info4



The Mathematics of Communication



4. Constructing an optimal code

Here are Stinson's data modified to take account of _ (space) with probability .166.

A .068 F .018 K .007 P .015 U .023 Z .0008
B .0125 G .016 L .033 Q .0008 V .008 .166
C .023 H .051 M .020 R .05 W .019  
D .043 I .070 N .067 S .063 X .001  
E .106 J .0016 O .0625 T .075 Y .016   

There is an algorithm for constructing a binary code for a set of characters with different probabilities, with cost per character as close as possible to the theoretical minimum - p1log2p1 - p2log2p2 - ... - pmlog2pm. The minimum can only be exactly attained if all the pi are exactly powers of 1/2.

  • List the characters in order of decreasing probability.
  • Put the characters at the ends of the branches of a binary tree in such a way that at each fork the total probability on one side is as close as possible to the total probability on the other.
  • Start at the base of the tree. All characters on the left branch have first code element 0. All characters on the right branch have first element 1.
  • Continue through the higher branchings, each time determining the next digit in the code for all the characters on one side and on the other.

Here is the process illustrated for English, with the probabilities from the table above (optimum=4.13).

               _ 000           _E <         /     E 001  _ETAO <          T 010 <        TAO-----<   A 0110  \                AO<   \                  O 0111    \               I 1000     \           IN<      \         /   N 1001       \ INSHR-<         S 1010        <       SHR-----<   H 10110         \               HR<          \                 R 10111           \             D 11000            \         DL<             \       /   L 11001              \ DLUC<         U 11010               <     UC------<                \             C 11011                 \      M 11100                  \ MWF<   W 111010                   <    WF<                    \      F 111011                     \         Y 111100                      \ YG----<                       <       G 111101                        \ P 111110                         <                           \ B 1111110                           <                            \ V 11111110                             <                              \ K 111111110                               <     J 11111111100                                \ JX<                                  <   X 11111111101                                  \     Q 11111111110                                                                QZ--<                                         Z 11111111111 

The cost for this code is 4.167 bits/character.