Let’s say I’ve started collecting cards, like Magic: the Gathering, or baseball cards. Yesterday, I bought a pack of 100 random cards. Today, I bought another pack of 100 random cards. 5 of them were duplicates of cards that I already had from yesterday.
Question: how many distinct cards are there in the game? I.e., if you wanted to collect the whole set of cards published by the game company, how big an album would you need?
My brother asked me this question yesterday. Actually, he’s working in genetics, and wants a catalog of all the types of RNA in an organism. The information he gave me was that he’d drawn two samples with a roughly equivalent number of RNA strands, and while he hadn’t sequenced them all, he knew that 5% of the strands in the second sample were duplicates of strands in the first sample. I don’t know how that’s possible. I assume there’s chemical that can give you that kind of result.
I came up with the collectible-card analogy above as a way to wrap my head around the problem.
I then reasoned as follows: look at the first card in the second pack. What are the odds of it being different from the first card in the first pack? If there are 10,000 distinct cards in the game, then the probability of two randomly-selected cards being different is 9999/10,000.
What, then, is the probability of the first card in the second pack being different from all of the cards in the first pack? It has to be different from the first card, different from the second card, different from the third, etc:
pdiff = (9999/10,000) × (9999/10,000) × (9999/10,000) × … (9999/10,000) × = (9999/10,000)100
or more generally, pdiff = ((N-1)/N)S = (1-(1/N))S, where N is the number of distinct cards in the game, and S is the sample size, i.e. the number of cards in a pack (including duplicates within a pack).
From measurement, we know that 5% of the cards in the second pack are duplicates of cards in the first, i.e., duplication happens 5% of the time, which is to say that 95% of them are different. So
pdiff = (1-(1/N))S = 0.95
We know pdiff and S, so we solve for N:
N = 1/(1-pdiff1/S)
Plugging in pdiff = 0.95 and S = 100, we find
N = 1950.
With S = 106, we get N = 19,495,838. Of course, I don’t always trust myself to get this sort of reasoning correct, so I wrote a quick and dirty Perl script that picks a million random numbers, then picks a million more and counts the duplicates. And the numbers came out pretty much as given above, which increases my confidence that I’m right.
Of course, one big assumption in the above is that there are no common or rare cards; that the probability distribution is even. But I’ll leave unequal probability distrubutions as an exercise for the reader.
Update: Edited to clarify what “in the game” means.