## Sequences of Characters in Random Strings

```Given a data file of randomly arranged characters consisting of
x A's, y B's and z C's, what is the probability that an A will be
followed by a B  0 times, 1 time, 2 times... ?

The number of distinct ways of arranging the x+y+z characters is
C(x+y+z,x)*C(y+z,y)*C(z,z).  Each of these ways is asumed to be
equally probable.  Now suppose we take k of the A's and k of the B's
and put them together into k double-characters 'AB'.  The number of
distinct ways of arranging k of these characters 'AB' and x-k
characters 'A' and y-k characters 'B' and z characters 'C' is

F(k) =  C(x+y+z-k,k)*C(x+y+z-2k,x-k)*C(y+z-k,y-k)*C(z,z)      (1)

Since we have placed no restriction on the remaining 'A' and 'B'
characters, there could be other substrings [AB] besides the k
double-characters that we built in.  Thus, the above number
represents the number of arrangements containing AT LEAST k
substrings [AB].  Moreover, if there are occurrances of [AB] other
than the basic k, then the arrangements counted by (1) are not all
distinct, because we can exchange the k double characters 'AB' with
the accidental substrings [AB].

To show how this works, consider the simple example with the letters
AABBBCC, which gives x=2, y=3, z=2.  In this case the total number of
distinct (and equi-probable) arrangements is C(7,2)*C(5,3)*C(2,2) =
210.  The number of arrangements with k=2 (or more) double characters
'AB' is C(5,2)*C(3,1)*C(2,2)  =  30.  Clearly there can be no
accidental substrings [AB] because we've used up all of our 'A's, so
the probability of exactly 2 occurrances of the substring [AB] is
30/210 = 1/7.  Now we consider k=1, i.e., the number of arrangements
with 1 OR MORE occurrances of [AB].  In this case the double-character
formula gives C(6,1)*C(5,1)*C(4,2)*C(2,2)  =  180.  Since there are
30 arrangements of exactly 2 occurrances, and 180 of 1 or more
occurrence, we might be tempted to conclude that there are 150 of
exactly 1 occurrence.  However, we have to take into account the fact
that some of the 180 arrangements are not distinct.  Specifically,
each of the 30 arangements of 2 occurrances shows up twice, so there
are really just 120 distinct arrangements with exactly 1 occurrence.
Thus the probability of 1 occurrence is 120/210 = 4/7.  It follows
that the probability of 0 occurrances is 2/7.  [End of simple example]

The general procedure is to first compute the total number of
arrangements given by C(x+y+z,x)*C(y+z,y)*C(z,z).  Then let m denote
the lesser of x and y, and compute the number of arrangements with
exactly m occurrances, which is F(m) from equation (1).  Then compute
the number with exactly m-1 occurrances.  This is given by equation
(1) with k=m-1, and then subtracting C(m,1)*F(m).  So, letting N(k)
denote the number of arrangements with exactly k occurrances, we
have
N(m)   =  F(m)
N(m-1) =  F(m-1) - C(m,1)N(m)

Now we want to know the number of distinct arrangements with exactly
m-2 occurrances.  This is given by

N(m-2) = F(m-2) - C(m-1,1)N(m-1) - C(m,2)N(m)

because F(m-2) is the number of arrangements with AT LEAST m-2
occurrances, and then we have to subtract the number of arrangements
with exactly m-1 occurrances, each of which was counted C(m-1,1)
times in F(m-2).  Then we have to subtract the number of arrangements
with exactly m occurrances, each of which was counted C(m,2) times
in F(m-2).

The general formula can be written as

j
N(m-j)  =  F(m-j)  -  SUM   C(m-j+t,t) N(m-j+t)          (2)
t=1

Interestingly, if we define k = m-j and move the left hand term into
the summation we have the nice identity

m-k
F(k)  =   SUM   C(k+t,t) N(k+t)
t=0

To show how this works on a slightly less trivial example, consider
the case x=7, y=5, z=11.  In this case we compute the following
results

Total number of arrangements = 1070845776
m  =  (the lesser of x and y) =  5

F(5) = 668304
F(4) = 21162960
F(3) = 211629600
F(2) = 888844320
F(1) = 1629547920
F(0) = 1070845776

N(5) = F(5)                                            =     668304
N(4) = F(4) - 5N(5)                                    =   17821440
N(3) = F(3) - 4N(4) - 10N(5)                           =  133660800
N(2) = F(2) - 3N(3) -  6N(4) - 10N(5)                  =  374250240
N(1) = F(1) - 2N(2) -  3N(3) -  4N(4) - 5N(5)          =  405437760
N(0) = F(0) -  N(1) -   N(2) -   N(3) -  N(4) - N(5)   =  139007232
----------
Total = 1070845776

Dividing equation (2) by the total number of arrangements, we arrive
at a recursive formula involving the probabilities.  Given the
positive integers x,y,z, set n=x+y+z and m=min{x,y}.  The probability
of exactly t occurrances of the string [AB] can then be computed
recursively using the formula

C(n-t,t) C(n-2t,x-t) C(y+z-t,y-t)      m-t
P{t} =  ---------------------------------  -  SUM  C(t+j,j) P{t+j}
C(n,x) C(y+z,y)               j=1

(3)

For example, consider the case x=7, y=5, z=11.  Here we have n=23
and m=5.  The probabilities are computed in descending order of t,
beginning with t = m, so we begin by computing

C(18,5) C(13,2) C(11,0)          3
P{5}  =  -------------------------  =    ------
C(23,7) C(16,5)              4807

(Notice that the summation in (1) drops out when computing P{m}.)  We
then compute P{4} using equation (1) as follows

C(19,4) C(15,3) C(12,1)                          80
P{4}  =  -------------------------  -  C(5,1)P{5}    =   ------
C(23,7) C(16,5)                             4087

Next we compute P{3} using equation (1) as follows

C(20,3) C(17,4) C(13,2)                                600
P{3} =  ------------------------- - C(4,1)P{4} - C(5,2)P{5}  =  ----
C(23,7) C(16,5)                                    4087

and so on down to P{0}.  The resulting values are tabulated below.

t              P{t}
---   ---------------------------
0       624/4807    =   0.12981
1      1820/4807    =   0.37861
2      1680/4807    =   0.34949
3       600/4807    =   0.12482
4        80/4807    =   0.01664
5         3/4807    =   0.00062

These numbers are consistent with the values of N(j) computed
previously.

It turns out that equation (3) can be simplified.  Let Q(t) denote
the first term on the right side (the big ratio of binomial
coefficients).  We know that Q(0) = 1 and it can be shown that
the ratio Q(t+1)/Q(t) for any t reduces to

Q(t+1)      (x-t)(y-t)
------  =  ------------                      (4)
Q(t)       (t+1)(n-t)

For example, with x=7, y=5, z=11 we immediately have

Q(0) = 1
Q(1) = Q(0)*(35/23) =    35/23
Q(2) = Q(1)*(6/11)  =  210/253
Q(3) = Q(2)*(5/21)  =   50/253
Q(4) = Q(3)*(1/10)  =    5/253
Q(5) = Q(4)*(3/95)  =   3/4807

so the probabilities can easily be calculated.  Moreover, equation
(4) implies
x! y! (n-t)!
Q(t) =   -------------------                      (5)
(x-t)! (y-t)! t! n!

so multiplying the top and bottom by t! we see the above is just the
ratio of binomial coefficients

C(x,t) C(y,t)
Q(t)  =   --------------
C(n,t)

Therefore the original equation (3) can be written in the form

C(x,t) C(y,t)        m-t
P{t}  =   ---------------  -  SUM  C(t+j,j) P{t+j}        (6)
C(n,t)           j=1

or, equivalently, as the nice identity

C(x,t) C(y,t)        m-t
-------------   =   SUM  C(t+j,j) P{t+j}            (7)
C(n,t)            j=0

where m = min{x,y} and P{t} denotes the probability of exactly t
occurrances of the sequence 'AB' in a string of length n containing
x 'A' characters and y 'B' characters randomly distributed.  (Is
there a generating function for these numbers?)

It's also interesting to notice that if we make the recursive
substitutions for P{t+j} in equation (6) we arrive at an explicit
formula for P{t}.  To illustrate this, consider the example x=7,
y=5, z=11, which has m=5 and n=23.  For convenience let qj and
pj denote Q(j) and let P{j} respectively.  Equation (6) then
gives the relations

p5 = q5
p4 = q4 - 5p5
p3 = q3 - 4p4 - 10p5
p2 = q2 - 3p3 -  6p4 - 10p5
p1 = q1 - 2p2 -  3p3 -  4p4 - 5p5
p0 = q0 -  p1 -   p2 -   p3 -  p4 - p5

Substituting for the values of pj on the right we arrive at the
alternating sums of the qj's as follows

p5 = q5
p4 = q4 - 5q5
p3 = q3 - 4q4 + 10q5
p2 = q2 - 3q3 +  6q4 - 10q5
p1 = q1 - 2q2 +  3q3 -  4q4 + 5q5
p0 = q0 -  q1 +   q2 -   q3 +  q4 -  q5

This leads to the following explicit formula for P(t)

m-t                  C(x,t+j) C(y,t+j)
P(t)  =  SUM  (-1)^j  C(t+j,t) -----------------          (8)
j=0                      C(n,t+j)

which lends itself to straight-forward computation because each
term is of the form C(t+j,t) Q(t+j) (with alternating signs).
Thus we can use the relation

(x-k)(y-k)
C(k+1,t) Q(k+1) = ------------ C(k,t) Q(k)
(k+1-t)(n-k)

to quickly compute each successive term.  A simple algorithm for
evaluating equation (8) is shown in BASIC below:

x=12 : y=19 : z=35 : t = 6
n=x+y+z : q=1 : p=0 : c=1
if y < x then m=y else m=x
for i=0 to m
if i > =t then p=p+c*q : c=-c*(i+1)/(i+1-t)
q=q*(x-i)*(y-i)/((i+1)*(n-i))
next i
print p

Now suppose we consider the same data as before, except the A's and
B's are indistinguishable, so there are really just two characters,
which we can call A's and B's, and we wish to know how many times an
A is followed immediately by another A.  In general we have a string
that looks something like

BBAAAAABAABBBBAAAAAABAABBBBAAAAAA

There are 21 A's and 12 B's and no C's in this string.  By the method
described previously we know the probability of exactly k occurrances
of the sequence {AB} in such a string.  The string shown above has
k=4 occurrances of {AB}, which split up the string into 5 substrings
as shown below

BBAAAAA BAA BBBBAAAAAA BAA BBBBAAAAAA

This means we have 5 substrings of continuous A's.  Since there are
a total of 21 A's, it follows that there are 21-5 = 16 occurrances of
one A followed by another.  So, if f(x,y,z,k) denotes the probability
of exactly k occurrances of {AB} in a string of x A's, y B's, and
z C's, and if P(x,y,k) denotes the probability of exactly k
occurrances of A followed by A in a string of x A's and y B's, then
we would have a relation that looks SOMETHING LIKE

P(x,y,k)  =  f(x,y,0,x-k-1)

I say "something like" because the preceeding example assumed the
final character was an A.  If I had stuck some B's onto the end of
that string I would have had 5 (instead of 4) occurrances of {AB},
but still just 5 continuous strings of A.  (Notice that leading B's
don't matter.)  So the number of continuous strings, given k
occurrances of {AB}, is either k or k-1, depending on whether the
last character in the string is A or B.  If the last character
is B then
P(x,y,k)  =  f(x,y,0,x-k)

So the overall answer should be an appropriately weighted combination
of these two cases.

To determine the weights we can use conditional probabilities.  The
A's in the string are split into k contiguous runs if and only if
we have

((exactly k-1 occurrances of [AB])  AND  (last char is A))

OR

((exactly k occurrances of [AB])  AND  (last char is B))

Since these two events are mutually exclusive, the combined
probability of their union is just the sum of the two individual
probabilities.  Of course, the 2nd clause applies only when k is
less than or equal to y, the number of B's in the string.  When k
equals y+1 the last character cannot be B.

Anyway, the probability of the first clause can be expressed as

Pr{(k-1 [AB]'s) AND (last=A)}

=  Pr{k-1 [AB]'s} Pr{(last=A)|(k-1 [AB]'s)}

We already know that Pr{k-1 [AB]'s} is given by f(x,y,0,k-1), so
we only need to determine the probability that the last character
is an A, given that there are exactly k-1 occurrances of [AB].

Since there are k-1 [AB]'s, we have

x-k+1  individual A's
y-k+1  individual B's
k-1    sequences [AB]
--------
x+y-k+1  total

and we want to know how many such sequences exist, and how many of
them end with the character A.  At first we might think the number
of such sequences is given by the multinomial

(x+y-k+1)!
------------------------
(x-k+1)! (y-k+1)! (k-1)!

but we have to be careful here, because some of these arrangements
will place an individual B immediately behind an individual A,
which means the arrangement doesn't have exactly k-1 occurrances
of [AB].  To exclude these cases, we need to multiply the above
multinomial by the fraction of such strings that have exactly
zero occurrances of an individual A followed by an individual B.
Fortunately we know how to compute this, because it's an application
of our 3-character formula.  Thus, we need to multiply the above
multinomial by f(x-k+1,y-k+1,k-1,0).

Now to determine the number of these arrangements that end with
the letter A we place an A in the last position and then determine
the number of arrangements of the remaining components, which
consist of

x-k     individual A's
y-k+1   individual B's
k-1     sequences [AB]
-----
x+y-k    total

By the same reasoning as before, we have the basic multinomial

(x+y-k)!
----------------------
(x-k)! (y-k+1)! (k-1)!

which must be multiplied by f(x-k,y-k+1,k-1,0).  So the ratio of
the number of arrangements ending with A to the total number of
arrangements is

(x-k+1) f(x-k,y-k+1,k-1,0)
Pr{(last=A)|(k-1 [AB]'s)}  =  ------------------------------
(x+y-k+1) f(x-k+1,y-k+1,k-1,0)

In the same way we can determine the probability that the last
character is A, given that there are exactly k (instead of k-1)
occurrances of [AB].  Then clearly the probability that the last
character is B is just the complement of this.  (It's easier to
determine the probability of the last character being A because
if you set the last character to B directly you have to place
a restriction of the final character of the preceeding string.)
The result is

(x-k) f(x-k-1,y-k,k,0)
Pr{(last=B)|(k [AB]'s)}  = 1 -  ----------------------
(x+y-k) f(x-k,y-k,k,0)

Combining all these results, we have shown that if f(x,y,z;k)
denotes the probability of exactly k occurrances of the sequence
AB in a string of x A's, y B's, and z C's, then the probability of
exactly x-t occurrances of A followed immediately by another A in
a string of x A's and y B's can then be expressed in the form

(x-t+1) f(x-t,y-t+1,t-1;0)
Pr{x-t} = ------------------------------ f(x,y,0;t-1)
(x+y-t+1) f(x-t+1,y-t+1,t-1;0)

/      (x-t) f(x-t-1,y-t,t;0) \
+ ( 1  -  ----------------------  ) f(x,y,0;t-1)
\      (x+y-t) f(x-t,y-t,t;0) /

with the understanding that the second term is applied only when
the values of x-t-1 and y-1 are non-negative.
```