The Caesar Cipher
One of the simplest examples of a substitution cipher is the Caesar cipher, which is said to have been used by Julius Caesar to communicate with his army. Caesar is considered to be one of the first persons to have ever employed encryption for the sake of securing messages. Caesar decided that shifting each letter in the message would be his standard algorithm, and so he informed all of his generals of his decision, and was then able to send them secured messages. Using the Caesar Shift (3 to the right), the message,
"RETURN TO ROME"
would be encrypted as,
"UHWXUA WR URPH"
In this example, 'R' is shifted to 'U', 'E' is shifted to 'H', and so on. Now, even if the enemy did intercept the message, it would be useless, since only Caesar's generals could read it.
Thus, the Caesar cipher is a shift cipher since the ciphertext alphabet is derived from the plaintext alphabet by shifting each letter a certain number of spaces. For example, if we use a shift of 19, then we get the following pair of ciphertext and plaintext alphabets:
Plaintext: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Ciphertext: T U V W X Y Z A B C D E F G H I J K L M N O P Q R S
To encipher a message, we perform a simple substitution by looking up each of the message's letters in the top row and writing down the corresponding letter from the bottom row. For example, the message
THE FAULT, DEAR BRUTUS, LIES NOT IN OUR STARS BUT IN OURSELVES.
would be enciphered as
MAX YTNEM, WXTK UKNMNL, EBXL GHM BG HNK LMTKL UNM BG HNKLXEOXL.
Essentially, each letter of the alphabet has been shifted nineteen places ahead in the alphabet, wrapping around the end if necessary. Notice that punctuation and blanks are not enciphered but are copied over as themselves.
Breaking a Caesar Cipher by Hand
Can a computer guess what shift was used in creating a Caesar cipher? The answer, of course, is yes. But how does it work?
The unknown shift is one of 26 possible shifts. One technique might be to try each of the 26 possible shifts and check which of these resulted in readable English text. But this approach has limitations. For one thing how would the computer recognize "readable English text?" For another, what if a muiltiple Caesar shift was used, as is the case for a Vigenere cipher , where each letter of the keyword provides the basis for a Caesar shift. That is, if the key word is bam, then every third letter of the plaintext starting at the first would be shifted by 'b' (=1) and every third letter beginning at the second would be shifted by 'a' (=0) and every third letter beginning at the third would be shifted by 'm' (=12). Obviously we can't depend on obtaining readable English text here.
A better approach makes use of statistical data about English letter frequencies. It is known that in a text of 1000 letters of various English alphabet occur with about the following relative frequencies:
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
73
9
30
44
130
28
16
35
74
2
3
35
25
78
74
27
3
77
63
93
27
13
16
5
19
1
This information can be useful in deciding the most likely shift used on a given enciphered message. Suppose the enciphered message is:
K DKVO DYVN LI KX SNSYD, PEVV YP CYEXN KXN PEBI, CSQXSPISXQ XYDRSXQ.
We can tally the frequencies of the letters in this enciphered message, thus
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
0
1
2
4
3
0
0
0
3
0
4
1
0
4
1
4
3
1
6
0
0
4
0
7
4
0
Now we can now shift the two tallies so that the large and small frequencies from each frequency distribution match up roughly. For example, if we try a shift of ten on the previous example, we get the following correspondence between English language frequencies and the letter frequencies in the message.
English Language Frequencies
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
73
9
30
44
130
28
16
35
74
2
3
35
25
78
74
27
3
77
63
93
27
13
16
5
19
1
Enciphered Message Frequencies
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
A
B
C
D
E
F
G
H
I
J
4
1
0
4
1
4
3
1
6
0
0
4
0
7
4
0
0
1
2
4
3
0
0
0
3
0
Note that in this case the large frequencies for cipher X and Y correspond to large for English N and O, the bare spots for cipher T and U correspond to bare spots for English J and K. Also, an isolated large frequency for cipher S correpsonds to a similar one for English I. In view of this evidence we needn't even worry too much about the drastic mismatch for English E, which is usually the most frequent letter in a random sample of English text.
If we now apply this substitution to the message we get:
A TALE TOLD BY AN IDIOT, FULL OF SOUND AND FURY, SIGNIFIYING NOTHING.
Using the Chi-square statisticThe chi-square statistic allows compare how closely a shift of the English frequency distribution matches the frequency distribution of the secret message. Here's an algorithm for computing the chi-square statistic:
Let ef(c) stand for the english frequency of some letter of the alphabet
Let mf(c) stand for the frequency of some letter of the message
For each possible shift s between 0 and 25:
For each letter c of the alphabet
Compute the sum of squares of mf((c + s) mod 26) divided by ef(c) That is, for a given character, say 'a', we compute the square of the frequency of that character shifted by one of the possible Caesar shifts and then divide it by the English frequency of that character. For a given shift, say 5, we do this for each of the 26 letters in the alphabet. We thereby get 26 different chi-square values. The shift s for which the number ChiSquare( s ) is smallest is the most likely candidate for the shift that was used to encipher the message.