Genetic information content of a human being

The human genome contains about 3.2 billion base pairs according to Wikipedia. Each base is one of 4 possibilities, that's 2 bits, so that's 6.4 billion bits of data, or 800 million bytes. Wikipedia also says that a single genome isn't very compressible, about 1.77 bits of information per base pair (out of a possible 2).

A child has two copies of each chromosome, one from the egg, one from the sperm. So the information content of a child is determined by forming the gametes (the eggs and sperm). Gametes have one copy of each chromosome; other cells have two. When gametes are formed, the two copies are mixed, and there are mutations. A crossover is a spot in a chromosome in gamete where one of the two copies starts contributing and the other stops.

There are about 70 mutations per child. Wikipedia gives the crossover rate in humans at about 1 per 100 million base pairs per meiosis, or about 32 per gamete, or 64 per child. Describing 70 mutations requires an offset (4 bytes per mutation) and the new base (2 bits, oh let's say 1 byte in case it's sometimes more than one base). Describing 64 crossovers requires 64 offsets (4 bytes each). And you need 5 bytes per parent, to remember who the parents were. That's a total of 70*5 + 64*4 + 5*2 = about 616 bytes per child, given the parents. That's probably compressible, too. If you wanted to store the name, birthdate, and place of origin as well, that's another 100 bytes.

Suppose a standard computer has 8 TB disk space. Storing the genomes of 10 billion people, at 800 MB apiece, takes a million computers. That's big, but plausible. But if you had full pedigrees, then 10 billion people is 6 TB. It easily fits on a single computer. You have to store some base as well (eventually you run out of parents). I don't know how big the base would have to be. I'm guessing that since most genes have under 20 alleles, it wouldn't be more than 100 complete genomes, so it would still easily fit on a single computer.

Don't believe me? I don't know if I believe it myself. But this should be easy to check. There is a full genome of 17 related individuals from 3 generations publicly available for download. Oops: according to their FAQ, their error rate is 1 in 105 base pairs, which means 32000 errors per genome, which is way more than the 70 actual mutations I want to confirm. Hmm.


Bob Predicts the Future

Table of Contents