Submission + - How Perl Saved the Human Genome Project (dobbscodetalk.com)
From the beginning, researchers realized that informatics would have to play a large role in the genome project. An informatics core formed an integral part of every genome center that was created. The mission of this core was twofold: to provide computer support and database services for their affiliated laboratories, and to develop data analysis and management software for use by the genome community as a whole.
Consider the steps that may be performed on a bit of newly sequenced DNA. First, there's a basic quality check on the sequence: Is it long enough, and are the number of ambiguous letters below the maximum limit? Then, there's the "vector check." For technical reasons, the human DNA must be passed through a bacterium before it can be sequenced (this is the process of "cloning"). Not infrequently, the human DNA gets lost somewhere in the process, and the sequence that's read consists entirely of the bacterial vector. The vector check ensures that only human DNA gets into the database.
Next, there's a check for repetitive sequences. Human DNA is full of repetitive elements that make fitting the sequencing jigsaw puzzle together challenging. The repetitive-sequence check tries to match the new sequence against a library of known repetitive elements. The penultimate step is to attempt to match the new sequence against other sequences in a large community database of DNA sequences. Often, a match at this point will provide a clue to the function of the new DNA sequence. After performing all these checks, the sequence (along with the information that's been gathered about it along the way) is loaded into the local laboratory database.
The process of passing a DNA sequence through these independent, analytic steps looks like a pipeline, and we realized that a UNIX pipe could handle the job. We developed a simple Perl-based data-exchange format called "boulderio" that allowed loosely coupled programs to add information to a pipe-based I/O stream. Boulderio is based on tag/value pairs. A Perl module makes it easy for programs to reach into the input stream, pull out only the tags it is interested in, do something with them, and drop new tags into the output stream. Tags that the program isn't interested in are just passed through to standard output so that other programs in the pipeline can get to them."