Importance of Being Urnest (B Balls, K Colors, N Urns)

Given N urns and B balls of each of K colors, randomly distribute 
the KB balls in the N urns such that no two balls of identical color 
are placed in the same urn.  What is the probability that at least 
one ball of each color will be in an urn by itself?

We know that the total number of distinct ways we can distribute the 
balls in the urns is C(N,B)^K where C(n,m) is the binomial coefficient
n!/(m!(n-m)!).  Each of these arrangements is equally likely, so "all" 
we need to do is determine how many of these arrangements satisfy the 
special condition that at least one ball of each color is in an urn 
by itself.  If we denote this number by Q(N,B,K) then we clearly have

                 Q(N,B,1) = C(N,B) 

                 Q(N,B,2) = C(N,B)^2 - C(N,B)

for all N,B with N >= B.  However, for more than two colors it gets
complicated.  One approach might be to determine a generating 
function, but a more simple-minded method of determining the number 
of arrangements of K distinct sets of B balls into N urns such that 
each color is in at least one urn by itself is to enumerate the 
partitions of K(B-1) into N-K or fewer parts no greater than K, 
then append K 1's, and then evaluate the combinations of colors.

To illustrate, let's compute the probability for 9 urns with 4 balls 
in each of 3 colors.  The partitions of 9 into 6 or fewer parts no 
greater than 3, augmented with three 1's, are shown below along with 
their repsective enumerations.  (To save some space, let {mn} denote 
the binomial coefficient C(m,n) and let [mn\jk] = ((m!n!)/(j!k!)).)

3 3 3        1 1 1  {96}{63} [3]                           10080
3 3 2 1      1 1 1  {92}{71}{64} {32}[4\2]                136080
3 3 1 1 1    1 1 1  {92}{76} [6\222]                       22680
3 2 2 2      1 1 1  {91}{83}{53} [33]                     181440
3 2 2 1 1    1 1 1  {91}{82}{65} (3[5\3]+6[5\22])         362880
3 2 1 1 1 1  1 1 1  {91}{81}{77} 3[7\322]                  45360
2 2 2 2 1    1 1 1  {94}{54} 3[44\22]                     272160
2 2 2 1 1 1  1 1 1  {93}{66} ([36\222]+18[6\32]+3[6\4])   143640
                                                         -------
                                          Q(9,4,3)   =   1174320

Thus the probability of distributing 12 balls, 4 in each of 3 colors,
into 9 urns such that each color is by itself in at least one urn
is 1174320/2000376 = 2330/3969 = 0.587049635...

In principle this method can be used to compute any Q(N,B,K), but it 
becomes increasingly laborious as the values of N, B, and K increase.  
I'd be interested to know if there is a generating function that 
would simplify the computation.

A related, but more tractable, problem is to compute the _expected 
number_ of urns containing exactly 1 red ball, and exactly 1 green 
ball, etc.  This can be computed fairly easily by a recursive method
because at each point you only need to keep track of the number of 
empty urns, the number of urns with more than one ball, the number of
urns with exactly 1 red, the number with exactly 1 green, and so on.  
The expected number of urns containing exactly one of a given color
is the same for each color, and is equal to 

                       B (N-B)^(K-1)
                       -------------
                          N^(K-1)

Suppose we are interested in just approximate results for parameters
in the range of N = 100 to 1000, K = 8 to 32, and we want to determine
the optimum value of B.  For parameters like these it's possible to 
come up with a reasonable estimate without too much trouble.  Suppose 
N=100, K=8, and B=10.  Begin by distributing the 10 white balls.  Now 
what is the probability that after distributing the other seven colors
we will have "covered" every one of the 10 white balls?  This depends 
on what percentage of the urns will get hit by one of the remaining 
seven colors.

We know that each color covers 10% of the urns (in this case), and 
that the coverages are mutually independent, so (in the limit for 
large N) we can just take the union, which is 1 - (1 - 0.1)^7.  This
means that about 52 of the urns will be covered by these seven colors
colors.  (We don't cover 70 urns with these 70 balls because they 
overlap.)

The total number of arrangements of the original 10 balls and the 52
"covered" urns is C(100,10)*C(100,52).  On the other hand, the number 
of these arrangements for which all 10 white balls are in a covered 
urn is just C(52,10)*C(100,52), so the probability that all 10 of the 
white balls are covered is just  C(52,10)/C(100,10), which is about 
0.000913.

Now, the situation is clearly symmetrical in the eight colors, so 
each one has probability of 0.000913 of being covered.  Thus the 
probability that one or more of the colors gets completely covered 
is just the union of these events.  Of course, they are not strictly 
independent, but for parameters in the range of interest they are 
essentially independent.  Therefore the probability of failure (i.e., 
one or more colors completely covered) is 1 - (1 - 0.000913)^8,  
which equals 0.00728.

Needless to say, there are several approximations buried in this 
procedure, and their validity would have to be evaluated for each 
set of parameters.  But I'd say for the range of parameters mentioned
above this should give pretty good results.  To formulate it 
explicitly, note that the "52" above was really rounded off, and 
the fractional part was thrown away.  This was done to give a nice 
integer argument for the binomial function C(m,n).  However, to avoid 
that roundoff, let's use the gamma function G to give our factorials, 
so we can define g(x,y) = G(x+1)/(G(y+1)G(x-y+1)).

What I computed above was the probability of failure, whereas 
P(N,B,K) was defined as the probability of success, i.e., every 
color has at least one urn to itself.  So the above result is 
P(100,10,8) = 0.99272.  With this convention the general (and 
definitely approximate) formula for P(N,B,K) is

                /      g( N(1 - (1 - B/N)^(K-1) , B ) \ K
  P(N,B,K)  =  ( 1  -  ------------------------------  )
                \                 g(N,B)              /

Actually, the overall exponent "K" should really be K-1 when K is 
very small (like 2), because the degrees of freedom require the two 
colors to cover each other.  However, for K greater than about 4 the 
independence is adequate to justify the full exponent K.

Anyway, this formula gives the following results for N=100 and K=8:

       B    1-P(100,B,8)            B    1-P(100,B,8)
      ---   -----------            ---   -----------
       0      1.0000                13      0.00992
       1      0.4303                14      0.01154
       2      0.1227                15      0.01372
       3      0.0485                16      0.01659
       4      0.02490               17      0.02033
       5      0.01549               18      0.02515
       6      0.01114               19      0.03133
       7      0.00898               20      0.03921
       8      0.00792               25      0.1198
       9      0.00751 <--min        30      0.3153
      10      0.00755               35      0.6227
      11      0.00797               40      0.8830
      12      0.00875               45      0.9834

Notice that the above function is a very nice "bathtub curve".



Return to MathPages Main Menu