The Collector's Problem

Suppose there exist n (maybe 100) baseball players, each of whom 
has his face on baseball cards.  A fan can buy cards, one at a 
time.  Each purchase has an equal chance of being any of the 
players (no matter how many cards are bought).  How many cards 
must the fan buy in order to have a complete set with probability 
p (say, 0.99)?

This is related to the well-known problem of determining the 
*expected* number of purchases necessary to acquire a complete set 
of cards, i.e., at least one of each of the n available types.  The 
expected number of purchases is

         E(n)  =  n(1 + 1/2 + 1/3 + ... + 1/n)

as discussed in the note Collecting k Complete Sets of Coupons.
However, it's somewhat more difficult to determine the number of 
purchases necessary to give a certain probability of having a 
complete set.  In a sense, this is a variation of the generalized 
"birthday problem", with 100 possible baseball cards instead of 
365 possible birthdays.  In general terms, if there are L distinct 
possible outcomes (each assumed to be equally probable), the problem 
is to determine the probability of a particular set of outcomes in 
N trials.

If N is small, then the "brute force" method can be used to compute 
the exact answer.  To take a simple example, suppose N=5.  The seven 
possible partitions of 5 are

                     5  = [5]
                        = [4+1]
                        = [3+2]
                        = [3+1+1]
                        = [2+2+1]
                        = [2+1+1+1]
                        = [1+1+1+1+1]

These partitions correspond to all possible outcomes.  The first
represents the cases when all 5 trials have the same outcome.  The
second represents the case when four of the five outcomes were the 
same and one was different.  The third represents the case when three 
of the trials had one outcome and the other two trials had another 
outcome, and so on.

Each of these seven outcomes has a definite probability.  For example,
the probability of outcome [5] is simply (1/L)^4, and the probability
of outcome [1+1+1+1+1] is given by (L-1)!/((L-5!)*(L)^4).  Let s(N,k) 
denote the number of components in the kth partition.  The kth partition 
of N will consist of a certain number of 1s  (i.e., unique outcomes), a 
certain number of 2s (i.e., "doubles"), a certain number of 3s (i.e., 
"triples"), and so on.  Let  a_t  denote the number of t-tuples in the 
kth partition.  Then define function

       M(N,k)   =    ---------------------------
                     PROD  ((t!)^(a_t))*((a_t)!)

This is called the M_3 multinomial coefficient in "The Handbook of 
Mathematical Functions" by Abramowitz and Stegun (chapter 24 on 
Combinatorial Analysis).  Then I believe the exact probability of 
Q[N,k] (the kth partition of N) is given by

                              M(N,k) L!
       Pr{Q[N,k]} =    -----------------------
                           (L-s(N,k))! L^N

Once we have computed the probability of each partition, we can then
add up the probabilities for the partitions having any particular
property of interest, such as those for which all possible outcomes
occurred at least once during the N trials (e.g., you have at least
one of the 100 available basaeball cards).  

This approach is easy if N is small.  The difficulty is that the number 
of possible partitions of N (of L or fewer components) increases very 
rapidly as N gets larger.  For example, p(180) = 684,957,390,936.
Therefore, computing the EXACT rational answers for large values of N 
may be computationally intractable, although you could probably derive 
some good asymptotic formulas.  Feller's "Introduction to Probability" 
discusses approximate solutions of this general problem.

Notice that the solution of the "standard" birthday problem (to find
the probability of NO duplicate results) is a special case of the above
formula.  In this case we want the probability of the single partition 
consisting of all 1s.  Here we have M(N,k)=1 and s(N,k)=N, so the answer 
to the standard birthday problem reduces to the well-known formula
Pr{no duplicates} = L!/[(L-N)!*L^N].

To answer the more specialized question of the probability of at least
one of each possible outcome, a Markov model is convenient.  The states
can be distinguished by a single number, the number of outcomes that
have occurred so far.  The 100x100 transition matrix is
      0.01     0      0      0      0   . . .    0      0     0
      0.99   0.02     0      0      0   . . .    0      0     0
        0    0.98   0.03     0      0   . . .    0      0     0
        0      0    0.97   0.04     0   . . .    0      0     0
        0      0      0    0.96   0.05  . . .    0      0     0
        .      .      .      .      .            .      .     .
        .      .      .      .      .            .      .     .
        .      .      .      .      .            .      .     .
        0      0      0      0      0          0.98     0     0
        0      0      0      0      0          0.02   0.99    0
        0      0      0      0      0            0    0.01   1.00

The probability of having seen all possible outcomes after n trials
is the [100,1] component of M^n.  For n=2^k I get the following
                   n           Pr

                    1        0.000000
                    2        0.000000
                    4        0.000000
                    8        0.000000
                   16        0.000000
                   32        0.000000
                   64        0.000000
                  128        0.000000
                  256        0.000137
                  512        0.556018
                 1024        0.996647
                 2048        0.999999
                 4096        1.000000
                 8192        1.000000

With this method you can determine the precise value of n for which the
probability first exceeds any specific value.

Return to MathPages Main Menu