Feature Column from the AMS

Digital Revolution (II) - Compression Codes and Technologies

Encoding and decoding

3. Encoding and decoding

Most codes in recent times involve the conversion of the information involved into a representation using the the binary digits 0 and 1, because binary is the basis of most computer calculations. The use of binary codes reflects the central role of computers in technology and commerce.

A simple example of a binary code for English vowels would be:

a = 000

e = 001

i = 010

o = 011

u = 100.

In order to decode information in coded form it is necessary to know where one code word ends and the next code word begins. There are many approaches to dealing with this situation. One solution is to have every code word have the same length, say, k. (In the code above k = 3.) In this case one knows that every k symbols sent are a code word. If one is using codes where each code word is a block of k binary digits, then there are 2^k different code words available to use. This means that if a message uses exactly 16 characters, one can use a block code of length 4 to represent the 16 characters but if the message has 17 characters, then one needs a block code of length 5. Of the 32 possible code words 15 will not be used for anything, which is not desirable. On the other hand, one way of telling where one code word begins and other code word ends is by using a break between the code words. This break symbol (in writing) is usually a space, comma, semicolon, etc. However, when the code words are short this means that a lot of space in the total message is used for the breaks, which is not desirable. For ease of exposition we will refer to the break symbol as a comma. One early idea that coding theorists suggested was the idea of designing a comma-free code. This was a code in which, because the code words had been chosen in a clever way, commas were unnecessary.

One particularly easy type of comma-free code is where no code word has a prefix which is another code word. Given the string 1010, its prefixes are 1, 10, and 101. If the symbols in a code word are being scanned from left to right and the strings 10, 110, and 1110 are code words, then since no prefix of any code word is a code word, we can maintain a dictionary of code words; as a code word arrives we can scan the string from left to right and decode the string to the first string in the dictionary which is a code word. Of course, this means that we are assuming that the string we get has no errors in it! The good news is that mathematics has ways to correct errors in strings during transmission, but a discussion of error correction codes and technologies will have to await a future column in this series. The nifty aspect of prefix-free codes (codes without commas in which no prefix of a code word is a code word) is that they are variable in length and, thus, offer the hope of using fewer symbols on average to send information than would be the case for block codes. It was David Huffman who responded to the challenge of designing such a code which was easy to encode and decode. (Sometimes one can prove abstract mathematical theorems that say that codes with a particular design have nice features, but if one can not also discover how to encode and decode these codes easily--either in theory or with implementable hardware--such codes are only mathematical curiosities.) It is tempting to believe that if one can compress something using a method, then one can repeat this process over and over again getting down to a very compact distillation. It is not difficult to see that when one uses a method where no information is lost, this is not possible. Compression systems are designed to compress particular types of data; when they are used on data of a kind they are not designed for, they can expand rather than compress the representation of the data.