TwoDimensional Runs in Bernoulli Trials 

Suppose we have a tray with a 20x20 grid of "dimples" onto which marbles can be placed. We dip this tray into a large barrel of red and blue marbles, with a population density of a = 0.95 red and b = 0.05 blue marbles, and each dimple acquires a marble. What is the probability of achieving at least one 5x5 square region of all red marbles? 

In principle, given an mbym square grid, we can determine the probability of at least one nbyn regions of “successes” (i.e., red marbles) by direct application of inclusionexclusion. For example, if m = n+1 there would be only four possible nbyn regions, and we could represent these by unions of the nine mutually exclusive regions shown in the figure below. 


Letting the letters denote the event that the respective regions consists of nothing but successes (i.e., red marbles), the event of having one or more nbyn regions of successes has the Boolean representation 

_{} 

If we now let the letters signify the probabilities of the respective events, we can apply inclusionexclusion to give 

_{} 

The probabilities of the individual events are 

_{} 

so the probability P_{1} of one or more nbyn runs in a region of size (n+1) by (n+1) is 
_{} 

In the same way we can determine that the probability P_{2} of one or more nbyn runs in a region of size (n+2) by (n+2) is 

_{} 

Likewise the probability P_{3} of one or more nbyn runs in a region of size (n+3) by (n+3) is 

_{} 

These results suggest that the first several terms of P_{k} are given by 

_{} 

so if a is sufficiently small (say around 0.5 or lower) these terms could be used to approximate the answer. For example, if our original problem had specified a probability of 0.6 for the red marbles (instead of 0.95), we could accurately compute the result 0.00062 using these few terms. Unfortunately for values of a approaching 1, the result doesn't converge unless nearly all of the terms are included. Since the number of terms seems to at least double with each increment of k, it seems there would be well over 100,000 terms in the full expression for P_{15}. 

Of course, one very simple way of placing a lower limit on the answer for large values of a is to consider mutually exclusive regions. We know that the probability of any particular 5x5 region containing nothing but red marbles is (0.95)^{25} = 0.2773..., and this applies to each of the 16 mutually exclusive 5x5 regions. The probability of any k of these events is simply the kth power of the probability of one of them, so the probability of the union is 

_{} 

This represents a lower bound on the probability that any of these mutually exclusive 5x5 regions is all red. 

To improve this lower bound, recall that in the onedimensional case we can express the probability of a run of n consecutive successes (each with probability e) in a sequence of m Bernoulli trials by a simple recurrence relation. Letting p_{j} denote the probability that the sequence of trials from 1 to j contains at least one run of n consecutive successes, we have p_{1} = p_{2} = p_{n}_{1} = 0, and p_{n} = e^{n}. For each subsequent trial the probability p_{j} of having at least one run to that point equals the previous probability p_{j}_{1} plus the probability that [the preceding j1 trials did not contain a run but the entire sequence up to the jth trial does contain a run]. The event in square brackets is true if and only if the last n trials (i.e., the trials j, j1, j2, …, jn+1) were all successes and the trial jn was a failure and the sequence of trials up to jn1 did not contain a run. Thus the probability p_{j} can be expressed as 

_{} 

This formula enables us to recursively compute the probability of one or more runs in a sequence of any given length. The ranges of the sequence involved in this formula are illustrated in the figure below for the case n = 5. 


The recurrence formula asserts that the probability for the entire range up to the blue cell to contain a run equals the probability for the range up to the adjacent yellow cell, plus the probability that the blue and yellow cells are all successes and the pink cell is a failure and the entire green range does not contain a run. 

Using this result we can improve the lower bound on the twodimensional problem by partitioning the original 20x20 grid into four mutually exclusive 20x5 stripes. If Q denotes the probability Q that a stripe contains one or more 5x5 square of pure red, then a lower bound on the overall probability is given by 1  (1  Q)^{4}. To determine Q we can treat a stripe as a onedimensional sequence of Bernoulli trials. Each 5x1 row of a stripe has a probability of e = (0.95)^{5} = 0.7737... of being all red, and so Q is the probability of getting one or more "runs" of length 5 in a sequence of 20 trials. Letting p_{j} denote the probability of one or more runs of 5 in j trials, we have the recursive relation 

_{} 

with the initial values 

_{} 

From this we can compute Q = p_{20} = 0.880786..., and therefore we have the lower bound for the entire 20x20 square of 

_{} 

To verify that these lower bounds are valid, we can estimate the result numerically by means of a Monte Carlo simulation. Simulating a 20x20 array of dimples, each with probability a of being red, we find that the approximate probabilities of at least one 5x5 region of pure red marbles for various values of a are as tabulated below. 

_{} 

These numerical results show that, if each dimple has probability of 0.95 of being filled by a red marble, then the probability of at least one 5x5 region of all red marbles differs from 1 by only about 1.8E6. Also, the probability rises rapidly from near zero to near unity as a increases from about 0.70 to about 0.90. 

One possible approach to computing the exact answer explicitly is to determine a twodimensional recurrence relation analogous to the recurrence for onedimensional runs. In general, given an mbym square array of Bernoulli trials with probability a of success on each individual trial, we seek the probability that the array will contain at least one nbyn square of successes. Letting p_{i,j} denote the probability that a rectangle of i columns and j rows contains at least one such nbyn twodimensional run, we obviously have pi,j = 0 for all i,j such that i < n or j < n, and we have 

_{} 

Incidentally, in the special case n = 1 we have _{} , so in the following we assume n>1. Now, we already know how to compute the values of p_{n,n+j} = p_{n+j,n} for any j by means of the onedimensional recurrence described previously. In particular, we have the values 

_{} 

At this point we have the values depicted in the figure below for the case n = 5. 


We next need to determine the value of p_{n+1,n+1}. By reasoning analogous to the onedimensional case, we can say that this equals the probability of one or more runs in the square region extending to the cell [n+1,n+1] but excluding that cell, plus the probability that a run is completed by that cell but there were no other runs in that square region. It might seem as if the first quantity would be given by p_{n+1,n} + p_{n,n+1} – p_{n,n}, which would give 

_{} 

However, this doesn’t fully account for all the interdependencies of overlapping runs. To determine the correct expression, note that there are only three possible positions for an nbyn run in the region excluding the cell [n+1,n+1], and the probability of the union is given by inclusionexclusion as 

_{} 

To this we must add the probability that a run is completed by the cell [n+1,n+1] and there are no other runs in the overall region. This condition entails that all n^{2} cells leading up to the subject corner are successes, and there is at least one failure in each of the three overlapping regions marked with blue, green, and red bars in the figure below. 


We can partition those cells into five groups, designated a through e as shown below. 


The Boolean expression for the required failures is 

_{} 

Letting the letters denote the probabilities of the individual components, the probability of the overall event is given by inclusionexclusion as 

_{} 

In each region we require that at least one cell is a failure, i.e., not all the cells are successes, so the individual component probabilities are 

_{} 

Inserting these values into the preceding expression, we get 

_{} 

So, multiplying this by a to the power n^{2} and adding the result to expression (1), we arrive at the probability 

_{} 

Naturally this is identical to the expression for P_{1} derived previously. The analogy with the onedimensional recurrence is clarified if we write this as follows 

_{} 

where P* is the probability of one or more 5x5 red regions in overall square region less the 5x5 square at the outer corner. For this particular case, P* = 0, but in general it is not zero. To calculate the probabilities of all the cells in the entire 20x20 grid, we need to generalize this formula to accommodate the general case illustrated below. 


Given the probabilities for the rectangular regions extending from the outer boundary to each of the green, pink, and yellow cells, we wish to compute the probability for the blue cell. Just as in the onedimensional case, this will equal the probability for the region extending from the boundary up to the adjacent yellow cells, plus the probability that the blue and yellow cells are all successes and the pink cells are not contained in any runs and there are no runs in the green cells. The task therefore reduces to determining a general expression for the probability that the union of two overlapping rectangular regions contains a run, given the probabilities of containing a run for each of the rectangular regions within that union. 

Another approach would be to regard the m^{2} trials as a linear sequence (reading left to right and top to bottom), and treat the squares of successes as linear patterns, e.g., with m=5,n=2 the squares are just linear sequences of the form 1100011 (where 0 means "don't care"). However, some way of excluding wraparound" squares would be needed, and there are complications in the recurrence relation for runs containing “don’t care” elements. 
