Ben North, September 2016

This write-up assumes some level of background knowledge of the Lorenz cipher machine and the attacks automated by Colossus. The Wikipedia article gives a good description, as does the source material provided by the *General Report on Tunny* (henceforth *GRT*).

The motivation for this study came in two parts:

### Question posed by Tutte

The book

*Colossus: The secrets of Bletchley Park's code-breaking computers*has an appendix written by William T. Tutte, '*My Work at Bletchley Park*'. Tutte was the Bletchley Park codebreaker who, in what has been called 'one of the greatest intellectual feats of World War II', reverse-engineered the structure of the Lorenz cipher machine just from a sample of key. In this appendix, he writes (p.368,**emphasis**added):Having got a key

*K*, we can write*Δ*(*K*_{1}+*K*_{2}) =*Δ*(*χ*_{1}+*χ*_{2}) +*Δ*(*ψ*_{1}+*ψ*_{2}).*Δ*(*ψ*_{1}+*ψ*_{2}) is about 70 per cent dot.*['Dot' is the term used for a binary zero.]*Hence*Δ*(*K*_{1}+*K*_{2}) must be in 70 per cent agreement with*Δ*(*χ*_{1}+*χ*_{2}). And the agreement would still be 70 percent for other pairs of impulses.**Surely with 1000 or so letters of key the rectangle method would have resolved**I suppose I could check this if I went to enough trouble. But what would be the point now, so long after the last Fish message was sent? On second thoughts the problem could be reformulated as one in pure mathematics and so made worth solving. It would not be the first of Bletchley's little problems to undergo such a transformation.*Δ*(*K*_{1}+*K*_{2}) into*Δχ*_{1}and*Δχ*_{2}?I thought the

**emphasised**question was an interesting one.### Similarity to Belief Propagation for Low-Density Parity Check codes

*GRT*explains the procedures and formulae used in recovering the*χ*wheels from ciphertext, and they struck me as very similar to the Belief Propagation algorithm used for decoding LDPC codes. I thought it would be interesting to more properly compare the two methods, and also to ask whether Belief Propagation, first proposed in 1982, was an improvement over the methods used at Bletchley Park.

This study combines these points, and looks at chi-breaking from key under both the 'accurate convergence' described in *GRT* and the 'belief propagation' approach. Can we recover *Δχ*_{1} and *Δχ*_{2} from 1000 or so letters of *K*?

We can get a sanity-check sense for whether '1000 or so letters of key' will be sufficient to break χ_{1} and χ_{2} by looking at the information carried in one received bit of *K*_{12}. We have to recover at most 72 bits (χ_{1} is of length 41; χ_{2} is of length 31; 'at most' because there are constraints on the patterns, described more fully below).

The received bit depends on the value of *B*, the 'dot'-biased noise term *Δ*(*ψ*_{1} + *ψ*_{2}), as follows:

- If
*B*= 0, we receive the true value of*Δχ*_{12}(*i*,*j*) — this happens 70% of the time; or - if
*B*= 1, we receive the complement of*Δχ*_{12}(*i*,*j*) — this happens 30% of the time.

This is a 'binary symmetric channel', with capacity 1 − *H*_{b}(*p*), where *H*_{b} is the binary entropy function and *p* is the crossover probability, here 30%. In our case, then, the capacity is 1 − *H*_{b}(0·3) = 0·119. Receiving 1000 such values ought to provide 119 bits of information, then, which exceeds the 72 we require. So on the face of it, we do have enough information; but whether our decoding algorithm is good enough to extract it is another question.

The process of 'Rectangling' used at Bletchley Park for recovering the wheel patterns is described in detail in Chapter 24 of GRT, but to summarise:

The unknown *χ*_{1} wheel can be represented as a binary-valued vector of length 41 (indexed from 0 up to 40); similarly, the *χ*_{2} wheel by a 31-long vector. We can generate a '*χ* stream', a time series of length-2 vectors whose first component is consecutive entries from *χ*_{1} and second *χ*_{2} by rotating these wheels (together) one bit-position at a time. We are interested in the streams *Δχ*_{1} and *Δχ*_{2}, where *Δχ*_{k}(*t*) = χ_{k}(*t*) + χ_{k}(*t* − 1). (The 'delta' implies difference but in the xor-based arithmetic used, addition is the same as subtraction.) We are also interested in their sum, denoted *Δχ*_{12} = *Δχ*_{1} + *Δχ*_{2}, which is cyclic of length 41 × 31 = 1271.

The model is that there is also a 'biased coin' *B* which produces 0 more often than it produces 1. We flip this coin independently for each *t*, and then observe a sequence of *Δχ*_{1}(*t*) + *Δχ*_{2}(*t*) + *B*(*t*).

The challenge is to recover *Δχ*_{1} and *Δχ*_{2}. This is only possible up to the transformation which inverts all bits of both *Δχ*_{1} and *Δχ*_{2}, because such a transformation would leave the observed *Δχ*_{12} unchanged. We do also have some constraints on the *Δχ* vectors; an immediate one is that such a vector must have an even number of 1s (otherwise it could not be 'integrated' to find *χ*), and see the discussion of 'legal wheels' below.

The 'biased coin' *B* can be:

If we are working with intercepted ciphertext, *B* is *ΔD*_{12}, the sum of the first and second impulses of delta-de-*χ* (i.e., delta-*ψ* + delta-plain).

If, as pondered in the Tutte quote motivating this study, we are working with a section of key *K* recovered from a depth, *B* is *Δψ*_{12}. Hence the equation in the quote:

*Δ*(*K*_{1} + *K*_{2}) = *Δ*(*χ*_{1} + *χ*_{2}) + *Δ*(*ψ*_{1} + *ψ*_{2}).

In full, the information we have is a list of the observed *Δχ*_{1}(*t* % 41) + *Δχ*_{2}(*t* % 31) + *B*(*t*) values for 0 ≤ *t* < *T*, where *T* is the number of letters received.

Since all *B*(*t*) are independent, and because of the cyclic nature of the *Δχ* streams, we can reduce the information to just a difference of 'number of 0s − number of 1s' corresponding to each *Δχ*_{1}(*i*) and *Δχ*_{2}(*j*). If we receive fewer than 1271 letters, we'll receive no information for some (*i*, *j*) pairs.

The information we have is therefore a 'rectangle' of signed counts, called *θ*(*i*, *j*).

The following process for estimating the *Δχ*_{2} and *Δχ*_{1} vectors from the *θ*(*i*, *j*) rectangle is described in Section 24C of GRT, and was known as 'crude convergence'.

The idea was, approximately speaking: Hypothesise some vector of 41 values for *Δχ*_{1} (a 'start'). Then, for each *j < 31*, look at all the *θ*(*i*, *j*) for *i* < 41. Because *B* gives 0 more often than it gives 1, a positive *θ*(*i*, *j*) is evidence that *Δχ*_{12}(*i*, *j*) = 0, i.e., that *Δχ*_{2}(*j*) = *Δχ*_{1}(*i*). A negative *θ*(*i*, *j*) is evidence that *Δχ*_{2}(*j*) ≠ *Δχ*_{1}(*i*).

Pretending for a minute that we know all *Δχ*_{1} values to be correct, then, this means:

*θ*(*i*,*j*) > 0 and*Δχ*_{1}(*i*) = 0 is evidence for*Δχ*_{2}(*j*) = 0;*θ*(*i*,*j*) > 0 and*Δχ*_{1}(*i*) = 1 is evidence for*Δχ*_{2}(*j*) = 1;*θ*(*i*,*j*) < 0 and*Δχ*_{1}(*i*) = 0 is evidence for*Δχ*_{2}(*j*) = 1;*θ*(*i*,*j*) < 0 and*Δχ*_{1}(*i*) = 1 is evidence for*Δχ*_{2}(*j*) = 0.

Let *ε*_{1} be the vector which has the value +1 whenever *Δχ*_{1} = 0 and the value −1 whenever *Δχ*_{1} = 1. Then Σ_{ i} [*ε*_{1}(*i*)*θ*(*i*, *j*)] is, in some sense, the total evidence that *Δχ*_{2}(*j*) = 0. If this sum is negative, we should suspect that *Δχ*_{2}(*j*) = 1.

In this way, we determine, on the preponderence of evidence, the most likely value of each *Δχ*_{2}(*j*).

Then, fix that 31-long vector *Δχ*_{2} and run the same process in the other direction to estimate *Δχ*_{1}. Repeat. The idea is that this process will converge to a good estimate of the two *Δχ* vectors.

The process of 'accurate convergence' handles the calculation in a more formal probabilistic setting, and derives a more accurate procedure. It tracks a continuous score for how likely each bit is to be zero or one, as opposed to just which value is more likely. The details are given in Section 24W(a) of GRT, but in summary:

First define *ζ* as the ratio *P*(*B* = 0) / *P*(*B* = 1), i.e., the factor by which it is more likely that *B* will yield 0 rather than 1. For our case, where '*Δ*(*ψ*_{1} + *ψ*_{2}) is about 70 per cent dot', we have *ζ* ≈ 2·33.

Then define the 'score' *x*_{i} as the log-base-*ζ*-likelihood-ratio of the hypothesis '*Δχ*_{1}(*i*) = 0' (as compared to the alternative '*Δχ*_{1}(*i*) = 1'). Similarly, *y*_{j} is the score of the hypothesis *Δχ*_{2}(*j*) = 0. By alternating between

- treating the
*x*_{i}as 'true' and updating the*y*_{j}accordingly; and - treating the
*y*_{j}as 'true' and updating the*x*_{i}accordingly,

we obtain the following update rules:

\[y_j = \sum_i f(\theta_{ij}, x_i);\quad x_i = \sum_j f(\theta_{ij}, y_j)\] where \[f(k, l) = \log_\zeta \left(\frac{\zeta^{k+l} + 1}{\zeta^k + \zeta^l}\right).\]

The process of 'accurate convergence' performs these updates in turn until convergence of the signs of the scores, at which point the two *Δχ* vectors have been estimated.

It occurred to me that the rectangling problem is very similar to the decoding problem for Low-Density Parity Check error-correcting systems, in the following sense.

When rectangling, we have a large collection of bits for which:

- there are parity relations between small subsets of them;
- we receive noisy information on the values of the bits,

and this is exactly the set-up for a LDPC code.

Expressing rectangling in LDPC terms, the transmitted bits are the individual *Δχ*_{1}(*i*) and *Δχ*_{2}(*j*), and also their (xor) sums *Δχ*_{12}(*i*, *j*). We receive no direct information on any *Δχ*_{1}(*i*) or *Δχ*_{2}(*j*) but do have some noisy information on each *Δχ*_{12}(*i*, *j*) — the *θ*(*i*, *j*) counts of excess zeros over ones. We can think of this as a situation where, in LDPC terms, we receive no data bits, and only have noisy information on the parity bits.

One popular decoding method for LDPC codes is Belief Propagation, one of a set of 'Factor Graph' algorithms. This paper gives a clear explanation of this class of algorithms:

In a Factor Graph setting, we view the LDPC decoding problem on a graph whose nodes are of three types: variables, checks, or observations. The graph's edges show which variables are related by parity checks, and which variable appears in which observation.

In our case, the parity checks arise from the definition of *Δχ*_{12}:

*Δχ*_{12}(*i*, *j*) = *Δχ*_{1}(*i*) + *Δχ*_{2}(*j*).

Equivalently,

*Δχ*_{1}(*i*) + *Δχ*_{2}(*j*) + *Δχ*_{12}(*i*, *j*) = 0.

Our variables, then, are the 41 *Δχ*_{1} values, the 31 *Δχ*_{2} values, and the 1271 *Δχ*_{12} values. As a factor graph, we also have 'dongle' nodes for the observations *θ*(*i*, *j*) concerning *Δχ*_{12}(*i*, *j*), and check nodes for the relationship between *Δχ*_{1}(*i*), *Δχ*_{2}(*j*), and *Δχ*_{12}(*i*, *j*). For a simplified 2 × 3 case, the graph is:

Each square '+' node indicates that the variables connected to it must sum to zero.

The 'message passing' algorithm (see the *KFL* paper for details) gives update rules in terms of numerical messages passed between nodes. In our case, we use the log-base-\(\zeta\)-likelihood-ratio representation, and define *x*_{i} to be the value of the *Δχ*_{1}(*i*) node, which is the sum of that node's incoming messages. Similarly, *y*_{j} is the value of the *Δχ*_{2}(*j*) node. These scores mean exactly the same as the *x*_{i} and *y*_{j} scores of the Accurate Convergence scheme.

The computation performed by a variable node for the LLR representation is just to sum the incoming message values (p.512 of *KFL*).

The computation performed by each of our degree-three check nodes is via the function \[CHK(\lambda_1, \lambda_2) = \frac{1 +
\lambda_1\lambda_2}{\lambda_1 + \lambda_2}.\] where \(\lambda\) are the likelihood ratio values (NB *not* the log-likelihood ratios). In terms of log-likelihood-ratio values *k* and *l*, then: \[CHK(k, l)
= \log_\zeta\left(\frac{1 + \zeta^k\zeta^l}{\zeta^k + \zeta^l}\right).\] which is identical to the Bletchley Park \(f()\) function for accurate convergence; we will therefore rename it \(f\). This \(f\) function is only ever called with some *θ*(*i*, *j*) as one of its arguments; no messages get sent towards the 'dongle' node containing the observation *θ*(*i*, *j*).

We keep track of *x*_{i}^{(j)}, the most recent message emitted by the variable node for *Δχ*_{1}(*i*) along the edge to the check node that variable shares with *Δχ*_{2}(*j*). The message value *y*_{j}^{(i)} is defined similarly. The messages into and out of a check node are:

The score *x*_{i} is the log-*ζ*-likelihood-ratio of '*Δχ*_{1}(*i*) = 0', and it is updated via the 31 incoming messages, to

\[x_i = \sum_{j=0}^{30} f(\theta_{ij}, y^{(i)}_j).\]

Note that this is almost the same as the Accurate Convergence update rule, except there the summand is *f*(*θ*_{ij}, *y*_{j}) rather than the *f*(*θ*_{ij}, *y*_{j}^{(i)}) we have here.

The message *y*_{j}^{(i)}, in turn, is the sum of all messages which *Δχ*_{2}(*j*) receives over all other edges, i.e., all edges besides the one connecting it to *Δχ*_{12}(*i*, *j*). Equivalently, *y*_{j}^{(i)} is the sum of all incoming messages (which is the score, *y*_{j}) reduced by the message it received along the edge connecting it to *Δχ*_{12}(*i*, *j*):

\[y^{(i)}_j = y_j - f(\theta_{ij}, x^{(j)}_i).\]

For a 'large' number of incoming edges, we might hope that the difference between *y*_{j} and *y*_{j}^{(i)} would be small, in which case Belief Propagation and Accurate Convergence should behave very similarly.

The actual operational *χ* patterns were significantly constrained. The rules for a valid ('legal') *χ* pattern are described in Section 22B of GRT:

*χ*must have, as nearly as possible, an equal number of 0s and 1s;*Δχ*must be a valid differenced wheel in that it must have an even number of 1s;*Δχ*must have, as nearly as possible, an equal number of 0s and 1s;*χ*must not have more than four consecutive equal values (equivalently,*Δχ*must not have more than three consecutive 0s).

It seems that these extra constraints were important when *χ*-breaking; Section 25D(e) of GRT talks about adjusting candidate wheels to make them legal. Not many details are given, but we can imagine that an approach a little like the following might have been used. No doubt the methods used by the experienced codebreakers were more sophisticated.

Recall that we can only recover the delta-wheels up to the transformation which inverts all bits of both of them — that transformation preserves *Δχ*_{12}, which is all we (noisily) observe. In view of this, we adopt the following steps for finding a pair of legal wheels 'close' to the converged estimate:

If the converged state (or its inverse) consists of two legal *Δχ* wheels, accept it.

Otherwise, use the magnitude of the *x*_{i} scores (for *Δχ*_{1}) and the *y*_{j} scores (for *Δχ*_{2}) to rank each wheel's bits in order from least certain to most certain, and take the least certain twelve for each wheel. Try flipping each single bit in turn; if that gives a legal *Δχ* wheel (or the inversion of one), add it to list of candidates for that wheel, noting the 'cost' (how much evidence was overruled), and whether the wheel needed to be inverted. Then try flipping two bits at a time, then three. Rank all candidates for each wheel from least costly to most.

Combine a candidate *Δχ*_{1} with a candidate *Δχ*_{2} such that either neither wheel is inverted or both are inverted. Accept the least costly candidate pair of wheels.

For the experiments described shortly, we need to be able to randomly generate legal wheels and evaluate the algorithms' performance on them. Section 25X of GRT notes that legal wheels may be enumerated by a process which, for the 41-long *χ*_{1} wheel, boils down to choosing ten integers, each between 1 and 4 inclusive, with sum 20; and another ten such with sum 21. These are the lengths of blocks of 0s and 1s.

The problem of randomly generating sets of ten such numbers, uniformly over all possibilities, turned out to be an interesting sub-project in the course of this study. I solved it by adapting the algorithm described in the paper Polynomial Time Perfect Sampler for Discretized Dirichlet Distribution (Matsui and Kijima) to incorporate an upper bound on the values each summand may take. I have not formally proved that the modified algorithm obeys all requirements to generate uniform samples, but I think it does, and the behaviour seems reasonable. For present purposes, then, it is certainly acceptable.

To explore these ideas, we can simulate the process of attemping to recover the two *Δχ* patterns as follows:

- Generate a random legal
*χ*wheel and a random legal_{1}*χ*wheel._{2} - From these, find the two
*Δχ*wheels. - Generate 1000 independent samples of
*B*(i.e.,*Δψ*_{12}), with*P*(*B*= 0) of 70% - Combine to create a sequence of 1000 values of
*K*via_{12}*K*=_{12}*Δχ*_{12}+*B*. - Form into rectangle
*θ*(*i*,*j*) by counting excess 0s over 1s for each (*i*,*j*). - For each algorithm (Accurate Convergence and Belief Propagation):
- Using a random starting estimate for the wheels' scores, apply the algorithm to find a 'raw' estimate of the two
*Δχ*wheels. - Count the number of correct bits in the estimates vs the ground truth patterns. Because of the inversion ambiguity, take the 'number correct' evaluation as the greater of the actual number of correct bits, and 72 − (raw number correct).
- Apply the 'find nearby legal wheel pair' process.
- Count the number of correct bits in these new patterns. This time we will not have an inverted wheel (because we must have arrived at
*Δχ*patterns with an even number of 1s), and so there is no 'also consider 72 − (raw number correct)' step.

- Using a random starting estimate for the wheels' scores, apply the algorithm to find a 'raw' estimate of the two

This simulation was performed 1,000,000 times, and the proportion of cases achieving at least *N* correct bits, for various *N*, was found. We are hoping for a high proportion of cases achieving a high number of correct bits.

The performance of the two algorithms, before and after legality enforcement, was as follows.

Left graph: Even before error correction using the legality constraint, both methods do well. They perfectly recover the wheel patterns around a third of the time, and 'almost' succeed (getting at most two bits wrong) a further half of the time. The Belief Propagation algorithm does very slightly better, although not meaningfully.

Right graph: After adjusting the recovered patterns so as to find a legal pair of wheel patterns, the 'perfect recovery' rate jumps to around 80% for each algorithm. The 'almost success' rate, allowing up to two wrong bits, jumps to around 95%. Note that we only ever get an even number of bits right, as an odd number of incorrect bits would mean one proposed delta-wheel would have an odd number of 1s, an impossibility.

These experiments have studied the most difficult pair of wheels, i.e., the longest ones. Breaking the shorter *χ* wheels would presumably be easier, with the algorithms achieving a higher success rate.

Although Belief Propagation does appear to perform slightly better than Accurate Convergence, I do not think it would have been useful to the codebreakers at Bletchley Park. The much greater amount of state required to run the algorithm (41 × 31 messages values rather than 41 + 31 scores) would make it impracticable to perform manually.

The detailed behaviour of the algorithms could be characterised more fully, but we have learnt enough to answer Tutte's question

Surely with 1000 or so letters of key the rectangle method would have resolved

Δ(K_{1}+K_{2}) intoΔχ_{1}andΔχ_{2}?

with 'very likely, yes'.

Available on github: bennorth/rectangling-ldpc.

Ben North, September 2016

This web-page content Copyright 2016 Ben North; licensed under CC BY-SA 4.0