Collecting k Complete Sets of Coupons

If there are n distinct kinds of baseball cards (or coupons, or
whatever), each equally likely to be received with any given purchase, 
what is the expected number of purchases in order to acquire k 
complete sets, i.e., at least k of each of the n types of cards?  

With k=1 this is just the standard "collector's problem", which has 
a simple and well-known answer, namely, that the expected number of 
purchases is
                 n(1 + 1/2 + 1/3 + ... + 1/n)

This is analagous to a problem in reliability theory with n parallel 
redundant components, each with an exponential failure rate of 1/T, so
the mean time to fail of each individual component is T.  Beginning 
from a full-up system, the mean (expected) time until ALL n components 
have failed is 

                 T(1 + 1/2 + 1/3 + ... + 1/n)

For failures of n redundant systems the proof of this expected mean 
time to failure can be seen immediately from the Markov model where L 
is the individual (exponential) failure rate of each component:
   ______          _______           ______               ______
  | 0    |   nL   |   1   | (n-1)L  |   2  |          L  |   n  |
  |failed|------->|failed |-------->|failed|--- ... -----|failed|
  |______|        |_______|         |______|             |______|
 
The mean times for these transitions are 1/nL, 1/((n-1)L),..,1/L,
which gives the sum (1/L)(1 + 1/2 + 1/3 + ... + 1/n).

Intuitively this is related to the collector's problem (with k=1)
because we can regard each of the n types of coupons as a "component", 
and getting a coupon of a given type is the same as "failing the 
component".  Since there is a 1/n chance of getting a coupon of that 
type on each purchase, the expected number of purchases to get one of 
that type is approximately n.  Thus, the "mean time to fail" for that 
"component" is roughly T=n, where we have made "time" proportional to
"purchases", and so the expected (mean) "time to fail" (i.e., number 
of purchases) to get ONE of EACH of the n types is just the quantity
n(1 + 1/2 + ... + 1/n) as noted above.  

(Incidentally, if there were infinitely many different types of 
coupons, we could obviously never get a complete set, so this gives 
a proof that the harmonic series diverges.)

However, the simple analogy between the reliability model and the
collector's problem breaks down for k greater than 1.  For this more
general case we need to evaluate the problem combinatorially.  We
might naively think, at least as a first approximation, that the
expected number of purchases needed to acquire k complete sets might 
be roughly k times the expected number of purchases needed to acquire 
just 1 complete set, but this turns out not to be a very good 
approximation.

If we focus first on just the simple case n=2, the direct combinatorial 
approach gives the expected number of purchases needed to get at 
least k of both types as

                inf
       E(k) =  SUM   (j+1) C(j,k-1) / 2^j            (1)
               j=2k-1

which gives E(1)=3, E(2)=11/2, E(3)=63/8, and so on.  This solves the
general problem for n=2, but unfortunately or n greater than 2 this 
approach requires us to evauate a summation over the set of partitions 
into n parts, which becomes fairly complicated as n increases.

For example, with n=3 and k=2, there are 3^6 possible (and equally
likely) sequences of coupons, but only 21 distinct partitions of 6
into three or fewer positive integers.  Of those partitions, only
[2,2,2] satisfies the requirement k=2.  The number of sequences that
give this partition is C(6,2)C(4,2)C(2,2)=90, so the contribution to
the expectation for 6 purchases is 6*(90/3^6).  Then we can proceed
to compute the probability of at least 2 of each type after 7
purchases, and subtract the probability after 6 to give the prob
that exactly 7 purchases are required, etc.  As n increases we will 
need to consider the partitions into more and more parts, and that 
is generally non-trivial.

Equation (1) gives the values

    k       E_2(k)       numerical   1st diff   2nd diff
 -------  -----------    ---------   --------   --------
     1         3/ 2^0     3.0000
     2        11/ 2^1     5.5000      2.5000
     3        63/ 2^3     7.8750      2.3750     -0.1250
     4       163/ 2^4    10.1875      2.3125     -0.0625
     5      1595/ 2^7    12.4609      2.2734     -0.0391
     6      3765/ 2^8    14.7070      2.2461     -0.0273
     7     17339/2^10    16.9326      2.2256     -0.0205
     8     39203/2^11    19.1420      2.2094     -0.0162
     9    699219/2^15    21.3384      2.1964     -0.0130
    10   1541665/2^16    23.5239      2.1855     -0.0109
    11   6737137/2^18    25.7001      2.1762     -0.0093
    12  14611029/2^19    27.8683      2.1682     -0.0080

It appears that the expected number of purchases to get k complete
sets of 2 coupons goes up by about 2 each time k is incremented,
so it ends up being just a little greater than 2k, whereas our 
earlier naive estimate (k times the expectation for one complete 
set) predicted 3k.  

It's interesting to note the exponents of 2 in the denominators of 
these expectations are

                0,1,3,4,7,8,10,11,15,16,18,19,...

and the differences between consecutive exponents are

                   1,2,1,3,1,2,1,4,1,2,1,...

which of course are the powers of 2 in the sequence of consecutive
even numbers 2,4,6,8,10,12,14,16,18,20,22,...  Notice that we can 
write the above values in the form

         E_2(k) = k(sum)

and when we do this we see a familiar pattern:

      E_2(1)  =  1(3)
      E_2(2)  =  2(3 - 1/4)
      E_2(3)  =  3(3 - 1/4 - 1/8)
      E_2(4)  =  4(3 - 1/4 - 1/8 - 5/64)
      E_2(5)  =  5(3 - 1/4 - 1/8 - 5/64 - 7/128)
                etc.

Thus each term is simply the coefficient in the expansion of
2 sqrt(1-x), which is

    _______
 2 / 1 - x   =   2  - x - (1/4)x^2 - (1/8)x^3 - (5/64)x^4 - ...


It's interesting to compare this with the well-known case k=1, 
for which we have

      E_1(1)  =  1(1)
      E_2(1)  =  2(1 + 1/2)
      E_3(1)  =  3(1 + 1/2 + 1/3)
      E_4(1)  =  4(1 + 1/2 + 1/3 + 1/4)
      E_5(1)  =  5(1 + 1/2 + 1/3 + 1/4 + 1/5)

where the terms are the coefficients of the expansion -ln(1+x)

   -ln(1+x)  =  (1)x + (1/2)x^2 + (1/3)x^3 + (1/4)x^4 + ...

To summarize, if there are n different kinds of coupons, let E_n(k) 
denote the expected number of purchases necessary to acquire k complete 
sets.  The well-known case is with k=1, in which case we have

         E_n(1) = n(1 + 1/2 + 1/3 + ... + 1/n)

for all n. Of course, the case of n=1 trivially gives E_1(k) = k
for all k.  We've also seen that for n=2 the values of E_2(k) 
can be given as a sum of k terms whose values (after the first)
are the leading coefficients in the power series expansion of
2sqrt(1-x).  Thus we have

              / 3     1     1*3     1*3*5          1*3*..*(2k-3) \
 E_2(k) = 2*k( --- - --- - ----- - ------- - ... - -------------  )
              \ 2    2*4   2*4*6   2*4*6*8          2*4*..*(2k)  /

for all k.  

Anyway, generalizing the combinatorial approach to give the expected 
number of purchases necessary to acquire k full sets of n coupons, we 
first determine the probability of each way of accomplishing this in a 
given number of purchases.  Clearly the probability of accomplishing 
this is zero for any number less than kn.  The probability of having 
k complete sets of n coupons after exactly kn purchases is

              M(nk;k,k,...,k) 
     q0   =   ---------------
                  n^(nk)

where M(a;b,c,..,z) signifies the multinomial coefficient, e.g.,
M(6;3,2,1) = (6!)/((3!)(2!)(1!)).  This represents the case of no 
excess coupons, so it can be regarded as a partition of 0 into n 
parts, added on top of a set of n consecutive k's: [k k k ... k].  
If we let p[j] denote the probability of achieving k complete sets 
in *exactly* j purchases, then we have p[j]=0 for all j less than 
kn, and p[kn] = q0.

The next case corresponds to the possible results after kn+1 
purchases, so there is exactly one excess coupon, and this can 
be regarded as a partition of 1 into n parts, so it looks like 
[k+1 k k ... k], where order is suppressed.  Of course, there 
are n distinct ways of arranging this partition, so the total
probability of having (at least) k complete sets of n coupons 
after kn+1 purchases is

               M(nk+1;k+1,k,k,..,k) 
    q1  =    n --------------------
                   n^(nk+1)

It follows that the probability of achieving k sets in *exactly*
kn+1 purchases is  p[kn+1] = q1 - q0.

Now, going on to cover the case of kn+2 purchases, there are
two possible ways of partitioning the 2 excess coupons.  We
have either [k+2 k k ... k]  or else  [k+1 k+1 k ... k].  In the
first case there are n distinct arrangements, whereas in the 
second case there are n(n-1)/2 distinct arrangements.  Thus 
the probability of having k complete sets after kn+2 purchases 
is q2, defined by

                
 n^(nk+2) q2  =  C(n,1) M(nk+2;k+2,k,k,...,k) 
                
               + C(n,2) M(nk+2;k+1,k+1,k,...,k) 

and the probability of achieving k sets in *exactly kn+2 purchases
is 
       p[kn+2]   =    q2 - (q1-q0) - q0    =    q2 - q1

Continuing on in this way, we can generate p[kn+j] for j=0,1,2,...
and then the expected number of purchases is the infinite sum

  E_n(k)  =  kn p[kn]  +  (kn+1) p[kn+1]  +  (kn+2) p[kn+2]  + ...

which can also be written in terms of the q values as

  E_n(k) = kn(q0) + (kn+1)(q1-q0) + (kn+2)(q2-q1) + ...

The partial sum up to the jth term is

             (kn+j) q{j} - (q0+q1+q2+...+q{j-1})

so as j goes to infinity and q{j} goes to 1, this equals

   E_n(k)  =   kn + (1-q0) + (1-q1) + (1-q2) + ...

To illustrate how this works, consider the simple case n=3
and k=2.  We have

       (3^6) q0  =  1M(6;2,2,2)  =    90

       (3^7) q1  =  3M(7;3,2,2)  =   630

       (3^8) q2  =  3M(8;4,2,2) 
                  + 3M(8,3,3,2)  =  2940

       (3^9) q3  =  3M(9;5,2,2) 
                  + 6M(9,4,3,2) 
                  + 1M(9;3,3,3)  = 11508

       (3^10)q4  =  3M(10;6,2,2) 
                  + 6M(10,5,3,2) 
                  + 3M(10;4,4,2) 
                  + 3M(10;4,3,3) = 40950

       (3^11)q5  =  3M(11;7,2,2)
                  + 6M(11;6,3,2)
                  + 6M(11;5,4,2)
                  + 3M(11;5,3,3)
                  + 3M(11;4,4,3) = 137610

so the expected number of purchases is

  E_3(2) =  6 + (1 - 90/3^6) + (1 - 630/3^7) + (1 - 2940/3^8)

                   + (1 - 11508/3^9) + (1 - 40950/3^10)

                          + (1 - 137610/3^11) + ...


         = 6 + 71/81 + 173/243 + 1207/2187 + 2725/6561

             +  2011/6561 + 4393/19683 + ...

This illustrates the two drawbacks of this approach.  First, the
sum doesn't converge very rapidly.  Second, the jth term requires 
us to evaluate all the partitions of j into n parts.  This isn't
too bad for n=3, but for large values of n it would become very
laborious, because the number of partitions rises rapidly.

Now, for n=3 the direct combinatorial approach gives the values

        k      E_3(k)
       ---  ------------
        1       11/2          5.5000000000
        2      347/36         9.6388888888
        3    17479/1296      13.4868827160
        4  3609499/209952    17.1920200807
        5                    20.8084225468
        6                    24.3628949881
        7                    27.8709986160
        8                    31.3427087302
        9                    34.7848707279
       10                    38.2024221968
       11                    41.5990628893
       12                    44.9776497020

This shows that the expected number of purchases goes up by
somewhat more than 3 each time k is incremented, and the first
differences between these numbers evidently go to 3.  It's
also interesting to note that the denominators of E_3(k) are

         (2^1)(3^0)
         (2^2)(3^2)
         (2^4)(3^4)
         (2^5)(3^8)

which seems to suggest a systematic rule of formation.  (Recall
that the denominators for n=2 were all of the form 2^j with
j = 0,1,3,4,7,8,... which showed that each terms was being
multiplied by successive even numbers.)

Return to MathPages Main Menu