Sequences of Characters in Random Strings 

Given a string of randomly arranged characters consisting of x A's, y B's and z C's, what is the probability that an A will be followed by a B exactly 0 times, 1 time, 2 times, and so on? To answer this question, we first note that the number of distinct ways of arranging the x+y+z characters is {x+y+z, x}{y+z, y){z, z} where the inline notation {m, n} signifies the binomial coefficient (m choose n). Each of these ways is assumed to be equally probable. Now suppose we take k of the A's and k of the B's and put them together into k doublecharacters 'AB'. The number of distinct ways of arranging k of these characters 'AB' and x−k characters 'A' and y−k characters 'B' and z characters 'C' is 



Since we have placed no restriction on the remaining 'A' and 'B' characters, there could be other substrings [AB] besides the k doublecharacters that we built in. Thus, the above number represents the number of arrangements containing at least k substrings “AB”. Moreover, if there are occurrences of ‘AB’ other than the basic k, then the arrangements counted by (1) are not all distinct, because we can exchange the k double characters 'AB' with the accidental substrings ‘AB’. 

To show how this works, consider the simple example with the letters AABBBCC, which gives x = 2, y = 3, z = 2. In this case the total number of distinct (and equiprobable) arrangements is {7,2}{5,3}{2,2} = 210. The number of arrangements with k = 2 (or more) double characters 'AB' is {5,2}{3,1}{2,2} = 30. Clearly there can be no accidental substrings ‘AB’ because we've used up all of our 'A's, so the probability of exactly 2 occurrences of the substring ‘AB’ is 30/210 = 1/7. Now we consider k = 1, i.e., the number of arrangements with 1 or more occurrences of “AB”. In this case the doublecharacter formula gives {6,1}{5,1}{4,2}{2,2} = 180. Since there are 30 arrangements of exactly 2 occurrences, and 180 of 1 or more occurrence, we might be tempted to conclude that there are 150 of exactly 1 occurrence. However, we have to take into account the fact that some of the 180 arrangements are not distinct. Specifically, each of the 30 arrangements of 2 occurrences shows up twice, so there are really just 120 distinct arrangements with exactly 1 occurrence. Thus the probability of 1 occurrence is 120/210 = 4/7. It follows that the probability of 0 occurrences is 2/7. 

The general procedure is to first compute the total number of arrangements given by {x+y+z,x}{y+z,y}{z,z}. Then let m denote the lesser of x and y, and compute the number of arrangements with exactly m occurrences, which is F(m) from equation (1). Then compute the number with exactly m−1 occurrences. This is given by equation (1) with k = m − 1, and then subtracting {m,1}F(m). So, letting N(k) denote the number of arrangements with exactly k occurrences, we have 



Now we want to know the number of distinct arrangements with exactly m−2 occurrences. This is given by 



because F(m−2) is the number of arrangements with at least m−2 occurrences, and then we have to subtract the number of arrangements with exactly m−1 occurrences, each of which was counted {m−1,1} times in F(m−2). Then we have to subtract the number of arrangements with exactly m occurrences, each of which was counted {m,2} times in F(m−2). 

The general formula can be written as 



Interestingly, if we define k = m−j and move the left hand term into the summation we have the nice identity 



To show how this works on a slightly less trivial example, consider the case x = 7, y = 5, z = 11. In this case the total number of arrangements is 1070845776, and we have m = 5 (the lesser of x and y). Using equation (1) we compute 



Then equation (2) gives the results 



Dividing equation (2) by the total number of arrangements, we arrive at a recursive formula involving the probabilities. Given the positive integers x,y,z, set n = x + y + z and m = min(x,y). The probability of exactly t occurrences of the string “AB” can then be computed recursively using the formula 



For example, consider the case x = 7, y = 5, z = 11. Here we have n = 23 and m = 5. The probabilities are computed in descending order of t, beginning with t = m, so we begin by computing 



(Notice that the summation in (1) drops out when computing P(m).) We then compute P(4) using equation (1) as follows 



Next we compute P(3) using equation (1) as follows 



and so on down to P(0). The resulting values are tabulated below. 



These numbers are consistent with the values of N(j) computed previously. 

It turns out that equation (3) can be simplified. Let Q(t) denote the first term on the right side (the big ratio of binomial coefficients). We know that Q(0) = 1 and it can be shown that the ratio Q(t+1)/Q(t) for any t reduces to 



For example, with x = 7, y = 5, z = 11 we immediately have 



so the probabilities can easily be calculated. Moreover, equation (4) implies 



so multiplying the top and bottom by t! we see the above is just the ratio of binomial coefficients 



Therefore the original equation (3) can be written in the form 



or, equivalently, as the nice identity 



where m = min(x,y) and P(t) denotes the probability of exactly t occurrences of the sequence 'AB' in a string of length n containing x 'A' characters and y 'B' characters randomly distributed. (Is there a generating function for these numbers?) 

It's also interesting that if we make the recursive substitutions for P(t+j) in equation (6) we arrive at an explicit formula for P(t). To illustrate this, consider the example x = 7, y = 5, z = 11, which has m = 5 and n = 23. For convenience let q_{j} and p_{j} denote Q(j) and let P(j) respectively. Equation (6) then gives the relations 



Substituting for the values of p_{j} on the right we arrive at the alternating sums of the q_{j}'s as follows 






This leads to the following explicit formula for P(t) 



which lends itself to straightforward computation because each term is of the form {t+j,t}Q(t+j) (with alternating signs). Thus we can use the relation 



to quickly compute each successive term. A simple algorithm for evaluating equation (8) is shown in BASIC below: 



Now suppose we consider the same data as before, except the A's and B's are indistinguishable, so there are really just two characters, which we can call A's and B's, and we wish to know how many times an A is followed immediately by another A. In general we have a string that looks something like 



There are 21 A's and 12 B's and no C's in this string. By the method described previously we know the probability of exactly k occurrences of the sequence “AB” in such a string. The string shown above has k = 4 occurrences of “AB”, which split up the string into 5 substrings as shown below 



This means we have 5 substrings of continuous A's. Since there are a total of 21 A's, it follows that there are 21−5 = 16 occurrences of one A followed by another. So, if f(x,y,z,k) denotes the probability of exactly k occurrences of “AB” in a string of x A's, y B's, and z C's, and if P(x,y,k) denotes the probability of exactly k occurrences of A followed by A in a string of x A's and y B's, then we would have a relation that looks something like 



We say "something like" because the preceding example assumed the final character was an A. If we had appended some B's onto the end of that string we would have had 5 (instead of 4) occurrences of “AB”, but still just 5 continuous strings of A. (Notice that leading B's don't matter.) So the number of continuous strings, given k occurrences of “AB”, is either k or k−1, depending on whether the last character in the string is A or B. If the last character is B then 



So the overall answer should be an appropriately weighted combination of these two cases. To determine the weights we can use conditional probabilities. The A's in the string are split into k contiguous runs if and only if we have 



Since these two events are mutually exclusive, the combined probability of their union is just the sum of the two individual probabilities. Of course, the 2nd clause applies only when k is less than or equal to y, the number of B's in the string. When k equals y+1 the last character cannot be B. 

In any case, the probability of the first clause can be expressed as 



We already know that Pr(k−1 “AB”s) is given by f(x, y, 0, k−1), so we only need to determine the probability that the last character is an A, given that there are exactly k−1 occurrences of “AB”. Since there are k−1 “AB”s, we have 



and we want to know how many such sequences exist, and how many of them end with the character A. At first we might think the number of such sequences is given by the multinomial 



but we have to be careful here, because some of these arrangements will place an individual B immediately behind an individual A, which means the arrangement doesn't have exactly k−1 occurrences of “AB”. To exclude these cases, we need to multiply the above multinomial by the fraction of such strings that have exactly zero occurrences of an individual A followed by an individual B. Fortunately we know how to compute this, because it's an application of our 3character formula. Thus, we need to multiply the above multinomial by f(x−k+1,y−k+1,k−1,0). 

Now to determine the number of these arrangements that end with the letter A we place an A in the last position and then determine the number of arrangements of the remaining components, which consist of 



By the same reasoning as before, we have the basic multinomial 



which must be multiplied by f(x−k, y−k+1, k−1, 0). So the ratio of the number of arrangements ending with A to the total number of arrangements is 



In the same way we can determine the probability that the last character is A, given that there are exactly k (instead of k−1) occurrences of “AB”. Then clearly the probability that the last character is B is just the complement of this. (It's easier to determine the probability of the last character being A because if we set the last character to B directly we have to place a restriction of the final character of the preceding string.) The result is 



Combining all these results, we have shown that if f(x,y,z,k) denotes the probability of exactly k occurrences of the sequence AB in a string of x A's, y B's, and z C's, then the probability of exactly x−k occurrences of A followed immediately by another A in a string of x A's and y B's can then be expressed in the form 



with the understanding that the second term is applied only when the values of x−k−1 and y−k are nonnegative. 
