Comment Re:Wrong problem (Score 2) 239
Well, as others have said, this is kind of correct. After sequencing, the raw reads (short sequences of DNA) are assembled into either transcripts of genome fragments (usually called contigs). This leads to a great reduction in the amount of data, but there is a lot of concern by scientists over whether or not to save all the raw data for future work. My take is that unless the sample is impossible to collect DNA/RNA from again, then toss it and assume that the sequencing technology will be better/faster/cheaper/longer in the future.
I'm actually involved with a large US National Science Foundation project to help build the cyberinfrastructure to help handle these data and analyses: the iPlant Collaborative: http://iplantcollaborative.org./ In addition, I maintain a set of web-based software for comparative genomics: CoGe, http://genomevolution.org./ From the standpoint of genomes, I adopted the philosophy of building a system that can easily accommodate new versions of existing genomes and new genomes. Thus, as new data becomes available, they get quickly loaded into the system and made available for analysis by any of the existing tools or compared to any of the already loaded genomes. So far, the system has scaled quite well and it is storing over 16,000 genomes from over 12,500 organisms. While the science is a lot of fun (sort of like the ultimate video game except no one knows the rules and there are no pre-built user interfaces), it is awesome to see how quickly the number of sequenced genomes has grown over such a short period of time. This is driven by how cheap the technology has become to use and the quantity of data that can be produced. For those interested, the National Human Genome Research Institute keeps track of this and has some very informative graphs: http://www.genome.gov/SequencingCosts/.
While it has also been said, the analyses and interpretation of these data is extremely rate limiting. Lots of opportunity for folks with programming, algorithm, data visualization, web, and user interface experience.
I'm actually involved with a large US National Science Foundation project to help build the cyberinfrastructure to help handle these data and analyses: the iPlant Collaborative: http://iplantcollaborative.org./ In addition, I maintain a set of web-based software for comparative genomics: CoGe, http://genomevolution.org./ From the standpoint of genomes, I adopted the philosophy of building a system that can easily accommodate new versions of existing genomes and new genomes. Thus, as new data becomes available, they get quickly loaded into the system and made available for analysis by any of the existing tools or compared to any of the already loaded genomes. So far, the system has scaled quite well and it is storing over 16,000 genomes from over 12,500 organisms. While the science is a lot of fun (sort of like the ultimate video game except no one knows the rules and there are no pre-built user interfaces), it is awesome to see how quickly the number of sequenced genomes has grown over such a short period of time. This is driven by how cheap the technology has become to use and the quantity of data that can be produced. For those interested, the National Human Genome Research Institute keeps track of this and has some very informative graphs: http://www.genome.gov/SequencingCosts/.
While it has also been said, the analyses and interpretation of these data is extremely rate limiting. Lots of opportunity for folks with programming, algorithm, data visualization, web, and user interface experience.