Here's the lowdown on how BZGF works, as one example. In this case, there are many short distinct of DNA being stored together, each with offset and quality information, many of which may be identical. The compression is localized to smaller blocks (I'm not sure if they're 4096-byte disk sectors or something else.) You're right that there's probably some performance lost due to the misalignment, but 6 and 8 line up every 24 bits, so at worst that means patterns of four codons or three bytes—and a step of four amino acids is ideal for alpha helix motifs, so it's not all a loss.
And, yes, regarding individual genomes: I'm pretty sure that'd be all anyone stored if they didn't have to hold onto the FASTQ files for auditability.
It's a neat thought, but it would never beat the basics. While there are a lot of genes that have common ancestors (called paralogues), the hierarchical history of these genes is often hard to determine or something that pre-dates human speciation; for example, there's only one species (a weird blob a little like a multi-cellular amoeba) that has a single homeobox gene.
While building a complete evolutionary history of gene families is of great interest to science, it's pointless to try exploiting it for compression when we can just turn to standard string methods; as has been mentioned elsewhere on this story, gzip can be faster than the read/write buffer on standard hard drives. Having to replay an evolutionary history we can only guess at would be a royal pain.
That being said, we can store individuals' genomes as something akin to diff patches, which brings 3.1 gigabytes of raw ASCII down to about 4 MB of high-entropy data, even before compression.
Highlights from the bottom of the PayPal Galactic page:
@Stratocumulus: RT @lbillin: #paypalgalactic Incur debt in space! Paypal wants to help http://t.co/cqVsVyCy0B
@JodyYeoh: I visited space and all I got was a probe. #PayPalGalactic
Well, if you really need to have that kind of contest...
The data files being discussed are text files generated as summaries of the raw sensor data from the sequencing machine. In the case of Illumina systems, the raw data consists of a huge high-resolution image; different colours in the image are interpreted as different nucleotides, and each pixel is interpreted as the location of a short fragment of DNA. (Think embarrassingly parallel multithreading.)
If we were to keep and store all of this raw data, the storage requirements would probably be a thousand to a million times what they currently are—to say nothing of the other kinds of biological data that's captured on a regular basis, like raw microarray images.
CNVs actually can be detected if you have enough read depth; it's just that most assemblers are too stupid (or, in computer science terms, "algorithmically beautiful") to account for them. SAMTools can generate a coverage/pileup graph without too much hassle, and it should be obvious where significant differences in copy number occur.
(Also, the human genome is about 3.1 gigabases, so about 3.1 GB in FASTA format. De novo assembles will tend to be smaller because they can't deal with duplications.)
What is now proved was once only imagin'd. -- William Blake