Given a data file of randomly arranged characters consisting of x A's, y B's and z C's, what is the probability that an A will be followed by a B 0 times, 1 time, 2 times... ? The number of distinct ways of arranging the x+y+z characters is C(x+y+z,x)*C(y+z,y)*C(z,z). Each of these ways is asumed to be equally probable. Now suppose we take k of the A's and k of the B's and put them together into k double-characters 'AB'. The number of distinct ways of arranging k of these characters 'AB' and x-k characters 'A' and y-k characters 'B' and z characters 'C' is F(k) = C(x+y+z-k,k)*C(x+y+z-2k,x-k)*C(y+z-k,y-k)*C(z,z) (1) Since we have placed no restriction on the remaining 'A' and 'B' characters, there could be other substrings [AB] besides the k double-characters that we built in. Thus, the above number represents the number of arrangements containing AT LEAST k substrings [AB]. Moreover, if there are occurrances of [AB] other than the basic k, then the arrangements counted by (1) are not all distinct, because we can exchange the k double characters 'AB' with the accidental substrings [AB]. To show how this works, consider the simple example with the letters AABBBCC, which gives x=2, y=3, z=2. In this case the total number of distinct (and equi-probable) arrangements is C(7,2)*C(5,3)*C(2,2) = 210. The number of arrangements with k=2 (or more) double characters 'AB' is C(5,2)*C(3,1)*C(2,2) = 30. Clearly there can be no accidental substrings [AB] because we've used up all of our 'A's, so the probability of exactly 2 occurrances of the substring [AB] is 30/210 = 1/7. Now we consider k=1, i.e., the number of arrangements with 1 OR MORE occurrances of [AB]. In this case the double-character formula gives C(6,1)*C(5,1)*C(4,2)*C(2,2) = 180. Since there are 30 arrangements of exactly 2 occurrances, and 180 of 1 or more occurrence, we might be tempted to conclude that there are 150 of exactly 1 occurrence. However, we have to take into account the fact that some of the 180 arrangements are not distinct. Specifically, each of the 30 arangements of 2 occurrances shows up twice, so there are really just 120 distinct arrangements with exactly 1 occurrence. Thus the probability of 1 occurrence is 120/210 = 4/7. It follows that the probability of 0 occurrances is 2/7. [End of simple example] The general procedure is to first compute the total number of arrangements given by C(x+y+z,x)*C(y+z,y)*C(z,z). Then let m denote the lesser of x and y, and compute the number of arrangements with exactly m occurrances, which is F(m) from equation (1). Then compute the number with exactly m-1 occurrances. This is given by equation (1) with k=m-1, and then subtracting C(m,1)*F(m). So, letting N(k) denote the number of arrangements with exactly k occurrances, we have N(m) = F(m) N(m-1) = F(m-1) - C(m,1)N(m) Now we want to know the number of distinct arrangements with exactly m-2 occurrances. This is given by N(m-2) = F(m-2) - C(m-1,1)N(m-1) - C(m,2)N(m) because F(m-2) is the number of arrangements with AT LEAST m-2 occurrances, and then we have to subtract the number of arrangements with exactly m-1 occurrances, each of which was counted C(m-1,1) times in F(m-2). Then we have to subtract the number of arrangements with exactly m occurrances, each of which was counted C(m,2) times in F(m-2). The general formula can be written as j N(m-j) = F(m-j) - SUM C(m-j+t,t) N(m-j+t) (2) t=1 Interestingly, if we define k = m-j and move the left hand term into the summation we have the nice identity m-k F(k) = SUM C(k+t,t) N(k+t) t=0 To show how this works on a slightly less trivial example, consider the case x=7, y=5, z=11. In this case we compute the following results Total number of arrangements = 1070845776 m = (the lesser of x and y) = 5 F(5) = 668304 F(4) = 21162960 F(3) = 211629600 F(2) = 888844320 F(1) = 1629547920 F(0) = 1070845776 N(5) = F(5) = 668304 N(4) = F(4) - 5N(5) = 17821440 N(3) = F(3) - 4N(4) - 10N(5) = 133660800 N(2) = F(2) - 3N(3) - 6N(4) - 10N(5) = 374250240 N(1) = F(1) - 2N(2) - 3N(3) - 4N(4) - 5N(5) = 405437760 N(0) = F(0) - N(1) - N(2) - N(3) - N(4) - N(5) = 139007232 ---------- Total = 1070845776 Dividing equation (2) by the total number of arrangements, we arrive at a recursive formula involving the probabilities. Given the positive integers x,y,z, set n=x+y+z and m=min{x,y}. The probability of exactly t occurrances of the string [AB] can then be computed recursively using the formula C(n-t,t) C(n-2t,x-t) C(y+z-t,y-t) m-t P{t} = --------------------------------- - SUM C(t+j,j) P{t+j} C(n,x) C(y+z,y) j=1 (3) For example, consider the case x=7, y=5, z=11. Here we have n=23 and m=5. The probabilities are computed in descending order of t, beginning with t = m, so we begin by computing C(18,5) C(13,2) C(11,0) 3 P{5} = ------------------------- = ------ C(23,7) C(16,5) 4807 (Notice that the summation in (1) drops out when computing P{m}.) We then compute P{4} using equation (1) as follows C(19,4) C(15,3) C(12,1) 80 P{4} = ------------------------- - C(5,1)P{5} = ------ C(23,7) C(16,5) 4087 Next we compute P{3} using equation (1) as follows C(20,3) C(17,4) C(13,2) 600 P{3} = ------------------------- - C(4,1)P{4} - C(5,2)P{5} = ---- C(23,7) C(16,5) 4087 and so on down to P{0}. The resulting values are tabulated below. t P{t} --- --------------------------- 0 624/4807 = 0.12981 1 1820/4807 = 0.37861 2 1680/4807 = 0.34949 3 600/4807 = 0.12482 4 80/4807 = 0.01664 5 3/4807 = 0.00062 These numbers are consistent with the values of N(j) computed previously. It turns out that equation (3) can be simplified. Let Q(t) denote the first term on the right side (the big ratio of binomial coefficients). We know that Q(0) = 1 and it can be shown that the ratio Q(t+1)/Q(t) for any t reduces to Q(t+1) (x-t)(y-t) ------ = ------------ (4) Q(t) (t+1)(n-t) For example, with x=7, y=5, z=11 we immediately have Q(0) = 1 Q(1) = Q(0)*(35/23) = 35/23 Q(2) = Q(1)*(6/11) = 210/253 Q(3) = Q(2)*(5/21) = 50/253 Q(4) = Q(3)*(1/10) = 5/253 Q(5) = Q(4)*(3/95) = 3/4807 so the probabilities can easily be calculated. Moreover, equation (4) implies x! y! (n-t)! Q(t) = ------------------- (5) (x-t)! (y-t)! t! n! so multiplying the top and bottom by t! we see the above is just the ratio of binomial coefficients C(x,t) C(y,t) Q(t) = -------------- C(n,t) Therefore the original equation (3) can be written in the form C(x,t) C(y,t) m-t P{t} = --------------- - SUM C(t+j,j) P{t+j} (6) C(n,t) j=1 or, equivalently, as the nice identity C(x,t) C(y,t) m-t ------------- = SUM C(t+j,j) P{t+j} (7) C(n,t) j=0 where m = min{x,y} and P{t} denotes the probability of exactly t occurrances of the sequence 'AB' in a string of length n containing x 'A' characters and y 'B' characters randomly distributed. (Is there a generating function for these numbers?) It's also interesting to notice that if we make the recursive substitutions for P{t+j} in equation (6) we arrive at an explicit formula for P{t}. To illustrate this, consider the example x=7, y=5, z=11, which has m=5 and n=23. For convenience let qj and pj denote Q(j) and let P{j} respectively. Equation (6) then gives the relations p5 = q5 p4 = q4 - 5p5 p3 = q3 - 4p4 - 10p5 p2 = q2 - 3p3 - 6p4 - 10p5 p1 = q1 - 2p2 - 3p3 - 4p4 - 5p5 p0 = q0 - p1 - p2 - p3 - p4 - p5 Substituting for the values of pj on the right we arrive at the alternating sums of the qj's as follows p5 = q5 p4 = q4 - 5q5 p3 = q3 - 4q4 + 10q5 p2 = q2 - 3q3 + 6q4 - 10q5 p1 = q1 - 2q2 + 3q3 - 4q4 + 5q5 p0 = q0 - q1 + q2 - q3 + q4 - q5 This leads to the following explicit formula for P(t) m-t C(x,t+j) C(y,t+j) P(t) = SUM (-1)^j C(t+j,t) ----------------- (8) j=0 C(n,t+j) which lends itself to straight-forward computation because each term is of the form C(t+j,t) Q(t+j) (with alternating signs). Thus we can use the relation (x-k)(y-k) C(k+1,t) Q(k+1) = ------------ C(k,t) Q(k) (k+1-t)(n-k) to quickly compute each successive term. A simple algorithm for evaluating equation (8) is shown in BASIC below: x=12 : y=19 : z=35 : t = 6 n=x+y+z : q=1 : p=0 : c=1 if y < x then m=y else m=x for i=0 to m if i > =t then p=p+c*q : c=-c*(i+1)/(i+1-t) q=q*(x-i)*(y-i)/((i+1)*(n-i)) next i print p Now suppose we consider the same data as before, except the A's and B's are indistinguishable, so there are really just two characters, which we can call A's and B's, and we wish to know how many times an A is followed immediately by another A. In general we have a string that looks something like BBAAAAABAABBBBAAAAAABAABBBBAAAAAA There are 21 A's and 12 B's and no C's in this string. By the method described previously we know the probability of exactly k occurrances of the sequence {AB} in such a string. The string shown above has k=4 occurrances of {AB}, which split up the string into 5 substrings as shown below BBAAAAA BAA BBBBAAAAAA BAA BBBBAAAAAA This means we have 5 substrings of continuous A's. Since there are a total of 21 A's, it follows that there are 21-5 = 16 occurrances of one A followed by another. So, if f(x,y,z,k) denotes the probability of exactly k occurrances of {AB} in a string of x A's, y B's, and z C's, and if P(x,y,k) denotes the probability of exactly k occurrances of A followed by A in a string of x A's and y B's, then we would have a relation that looks SOMETHING LIKE P(x,y,k) = f(x,y,0,x-k-1) I say "something like" because the preceeding example assumed the final character was an A. If I had stuck some B's onto the end of that string I would have had 5 (instead of 4) occurrances of {AB}, but still just 5 continuous strings of A. (Notice that leading B's don't matter.) So the number of continuous strings, given k occurrances of {AB}, is either k or k-1, depending on whether the last character in the string is A or B. If the last character is B then P(x,y,k) = f(x,y,0,x-k) So the overall answer should be an appropriately weighted combination of these two cases. To determine the weights we can use conditional probabilities. The A's in the string are split into k contiguous runs if and only if we have ((exactly k-1 occurrances of [AB]) AND (last char is A)) OR ((exactly k occurrances of [AB]) AND (last char is B)) Since these two events are mutually exclusive, the combined probability of their union is just the sum of the two individual probabilities. Of course, the 2nd clause applies only when k is less than or equal to y, the number of B's in the string. When k equals y+1 the last character cannot be B. Anyway, the probability of the first clause can be expressed as Pr{(k-1 [AB]'s) AND (last=A)} = Pr{k-1 [AB]'s} Pr{(last=A)|(k-1 [AB]'s)} We already know that Pr{k-1 [AB]'s} is given by f(x,y,0,k-1), so we only need to determine the probability that the last character is an A, given that there are exactly k-1 occurrances of [AB]. Since there are k-1 [AB]'s, we have x-k+1 individual A's y-k+1 individual B's k-1 sequences [AB] -------- x+y-k+1 total and we want to know how many such sequences exist, and how many of them end with the character A. At first we might think the number of such sequences is given by the multinomial (x+y-k+1)! ------------------------ (x-k+1)! (y-k+1)! (k-1)! but we have to be careful here, because some of these arrangements will place an individual B immediately behind an individual A, which means the arrangement doesn't have exactly k-1 occurrances of [AB]. To exclude these cases, we need to multiply the above multinomial by the fraction of such strings that have exactly zero occurrances of an individual A followed by an individual B. Fortunately we know how to compute this, because it's an application of our 3-character formula. Thus, we need to multiply the above multinomial by f(x-k+1,y-k+1,k-1,0). Now to determine the number of these arrangements that end with the letter A we place an A in the last position and then determine the number of arrangements of the remaining components, which consist of x-k individual A's y-k+1 individual B's k-1 sequences [AB] ----- x+y-k total By the same reasoning as before, we have the basic multinomial (x+y-k)! ---------------------- (x-k)! (y-k+1)! (k-1)! which must be multiplied by f(x-k,y-k+1,k-1,0). So the ratio of the number of arrangements ending with A to the total number of arrangements is (x-k+1) f(x-k,y-k+1,k-1,0) Pr{(last=A)|(k-1 [AB]'s)} = ------------------------------ (x+y-k+1) f(x-k+1,y-k+1,k-1,0) In the same way we can determine the probability that the last character is A, given that there are exactly k (instead of k-1) occurrances of [AB]. Then clearly the probability that the last character is B is just the complement of this. (It's easier to determine the probability of the last character being A because if you set the last character to B directly you have to place a restriction of the final character of the preceeding string.) The result is (x-k) f(x-k-1,y-k,k,0) Pr{(last=B)|(k [AB]'s)} = 1 - ---------------------- (x+y-k) f(x-k,y-k,k,0) Combining all these results, we have shown that if f(x,y,z;k) denotes the probability of exactly k occurrances of the sequence AB in a string of x A's, y B's, and z C's, then the probability of exactly x-t occurrances of A followed immediately by another A in a string of x A's and y B's can then be expressed in the form (x-t+1) f(x-t,y-t+1,t-1;0) Pr{x-t} = ------------------------------ f(x,y,0;t-1) (x+y-t+1) f(x-t+1,y-t+1,t-1;0) / (x-t) f(x-t-1,y-t,t;0) \ + ( 1 - ---------------------- ) f(x,y,0;t-1) \ (x+y-t) f(x-t,y-t,t;0) / with the understanding that the second term is applied only when the values of x-t-1 and y-1 are non-negative.

Return to MathPages Main Menu