Here's the lowdown on how BZGF works, as one example. In this case, there are many short distinct of DNA being stored together, each with offset and quality information, many of which may be identical. The compression is localized to smaller blocks (I'm not sure if they're 4096-byte disk sectors or something else.) You're right that there's probably some performance lost due to the misalignment, but 6 and 8 line up every 24 bits, so at worst that means patterns of four codons or three bytes—and a step of four amino acids is ideal for alpha helix motifs, so it's not all a loss.
And, yes, regarding individual genomes: I'm pretty sure that'd be all anyone stored if they didn't have to hold onto the FASTQ files for auditability.
It's a neat thought, but it would never beat the basics. While there are a lot of genes that have common ancestors (called paralogues), the hierarchical history of these genes is often hard to determine or something that pre-dates human speciation; for example, there's only one species (a weird blob a little like a multi-cellular amoeba) that has a single homeobox gene.
While building a complete evolutionary history of gene families is of great interest to science, it's pointless to try exploiting it for compression when we can just turn to standard string methods; as has been mentioned elsewhere on this story, gzip can be faster than the read/write buffer on standard hard drives. Having to replay an evolutionary history we can only guess at would be a royal pain.
That being said, we can store individuals' genomes as something akin to diff patches, which brings 3.1 gigabytes of raw ASCII down to about 4 MB of high-entropy data, even before compression.
Highlights from the bottom of the PayPal Galactic page:
@Stratocumulus: RT @lbillin: #paypalgalactic Incur debt in space! Paypal wants to help http://t.co/cqVsVyCy0B
@JodyYeoh: I visited space and all I got was a probe. #PayPalGalactic
Well, if you really need to have that kind of contest...
The data files being discussed are text files generated as summaries of the raw sensor data from the sequencing machine. In the case of Illumina systems, the raw data consists of a huge high-resolution image; different colours in the image are interpreted as different nucleotides, and each pixel is interpreted as the location of a short fragment of DNA. (Think embarrassingly parallel multithreading.)
If we were to keep and store all of this raw data, the storage requirements would probably be a thousand to a million times what they currently are—to say nothing of the other kinds of biological data that's captured on a regular basis, like raw microarray images.
CNVs actually can be detected if you have enough read depth; it's just that most assemblers are too stupid (or, in computer science terms, "algorithmically beautiful") to account for them. SAMTools can generate a coverage/pileup graph without too much hassle, and it should be obvious where significant differences in copy number occur.
(Also, the human genome is about 3.1 gigabases, so about 3.1 GB in FASTA format. De novo assembles will tend to be smaller because they can't deal with duplications.)
I can't comment on the physics data, but in the case of the bio data that the article discusses, we honestly have no idea what to do with it. Most sequencing projects collect an enormous amount of useless information, a little like saving an image of your hard drive every time you screw up grub's boot.lst. We keep it around on the off chance that some of it might be useful in some other way eventually, although there are ongoing concerns that much of the data just won't be high enough quality for some stuff.
On the other hand, a lot of the specialised datasets (like the ones being stored in the article) are meant as baselines, so researchers studying specific problems or populations don't have to go out and get their own information. Researchers working with such data usually have access to various clusters or supercomputers through their institutions; for example, my university gives me access to SciNet. There's still vying for access when someone wants to run a really big job, but there are practical alternatives in many cases (such as GPGPU computing.)
Also, I'm pretty sure the Utah data centre is kept pretty busy with its NSA business.
This happens occasionally in animal breeding. Blue eyes in a white-furred cat has a high chance of indicating deafness.
That being said, however, the definition of "proper" biochemical function is relative, so you can't really say that a developmental gene that produces healthy results is really malfunctioning. A lot of subtle differences between people are caused by changes in how long or how tightly two proteins interact. You could call the European light skin phenotype evidence of a defective gene, because it's defined by a shortage of melanosomes, which protect the body from UV light. (On the other hand, it improves vitamin D production, which requires UV light.)
There are even plenty of cases in the human body where healthy behaviour depends on what should be, by all rights, improper gene function: the cervical plug is made up largely of malformed virus particles (just the shells) which our ancestors commandeered millions of years ago. Without this strange adaptation, most pregnancies would fail. The attached placenta also owes its heritage to viral genes; without it, newborn human babies wouldn't be much larger than newborn rats.
Today is a good day for information-gathering. Read someone else's mail file.