Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Beginning Perl for Bioinformatics

Posted by timothy on Tue Jan 29, 2002 10:00 AM
from the listen-up-class dept.
babbage writes:"As the banner above the title of James Tisdall's Beginning Perl for Bioinformatics indicates, this book is 'an introduction to Perl for biologists.' What the banner doesn't mention is that it's also an introduction to biology and bioinformatics for Perl programmers, and it's also an introduction to both Perl *and* biology for people that have never really been exposed to either field. The author has clearly thought a lot about making one book to please these different audiences, and he has pulled it off nicely, in a way that manages to explain basic topics to people learning about each field for the first time while not coming off as condescending or slow-paced to those that might already have some exposure to it." Read on for the rest of his review.
Beginning Perl for Bioinformatics
author James Tisdall
pages 400
publisher O'Reilly & Associates
rating 8
reviewer babbage
ISBN 0-596-00080-4
summary Well-balanced approach to applying Perl's sorting and analytical abilities to the field of bioinformatics.

Superficially, this book isn't all that different from a lot of introductory Perl books: the Perl material starts out with an overview of the language, followed by a crash course on installing Perl, writing programs, and running them. From there, it goes on to introduce all the various language constructs, from variables to statements to subroutines, that any programmer is going to have to get comfortable with. Pretty run of the mill so far. Tisdall starts with two interesting assumptions, though: [1] that the reader may have never written a computer program before, and so needs to learn how to engineer a robust application that will do its job efficiently and well, and [2] that the reader wants to know how to write programs that can solve a series of biological problems, specifically in genetics and proteomics.

As such, there is at least as much material about the problems that a biologist faces and the places she can go to get the data she needs as there is about the issues that a Perl programmer needs to be aware of. The author introduces the reader to the basics of DNA chemistry, the cellular processes that convert DNA to RNA and then proteins, and a little bit about how and why this is important to the biologist and what sorts of information would help a biologist's research. The main sources of public genetic data are noted, and the often confusing -- and huge -- datafiles that can be obtained from these sources are examined in detail.

With the code he presents for solving these problems, Tisdall makes a point of not falling into the indecipherable-Perl trap: this is a useful language, well-suited to the essentially text-analysis problems that bioinformatics means, and he doesn't want to encourage the kind of dense, obscure, idiomatic coding style that has given Perl an undeservedly bad reputation. Some of Perl's more esoteric constructs are useful, and they show up when they're needed, but they're left out when they would only serve to confuse the reader. This is a good decision.

Rather, the focus is on teaching readers how to solve biological problems with a carefully developed library of code that happens to leverage some of Perl's most useful properties. The result is pretty much a biologist's edition of Christiansen & Torkington's Perl Cookbook or Dave Cross' Data Munging With Perl. The author presents a series of issues that a working bioinformaticist might have to deal with daily -- parsing over BLAST, GenBank, and PDB files, finding relevant motifs in that parsed data, and preparing reports about all of it. If a bioinformaticist's job is to be able to report on interesting patterns from these various sources, then following the programming techniques that Tisdall explains in clear, easy-to-follow prose would be an excellent way to go about doing it.

And when I say "programming techniques," note that I'm not specifically mentioning Perl. The code in this book is clear and organized, and all programs are carefully decomposed into logical subroutines that are then packaged up into a library file that each later sample program gets to draw from. Each new program typically contains a main section of a dozen lines of code or less, followed by no more than two or three new subroutines, along with calls to routines written earlier and called from the BeginPerlBioinfo.pm that is built up as the book progresses. Each sample is typically preceded by a description of what it's trying to accomplish and followed by a detaild description of how it was done, as well as suggestions of other ways that might have worked or not worked.

This modular approach is fantastic -- too many Perl books seem to focus so heavily on the mechanics of getting short scripts to work that they lose sight of how to build up a suite of useful methods and, from those methods, to develop ever-more-sophisticated applications. It isn't quite object-oriented programming, but that's clearly where Tisdall is headed with these samples, and given a few more chapters he probably would have started formally wrapping some of this code into OO packages.

If I have a complaint with the book, in fact, it's that Tisdall doesn't go any further: everything is good, but it ends too soon. Seemingly important topics such as OO programming, XML, graphics (charts & GUIs), CGI, and DBI are mentioned only in passing, under "further topics" in the last chapter. I also have a feeling that some of the biology was shorted, and the book barely touches upon the statistical analysis that probably is a critical aspect of the advanced bioinformaticist's toolbox. I can understand wanting to keep the length of a beginner's book relatively short, and this was probably the right decision, but it would have been nice to see some of the earlier sample problems revisited in these new contexts by, for example, formally making an OO library, showing a sample program that provided a web interface to some of the methods already written, or presenting code that presented results as XML or exchanged them with a database.

But these are minor quibbles, and if the reader is comfortable with the material up to this point, she shouldn't have a hard time figuring out how to go a step further and do these things alone. It's a solid book, and one that should be able to get people learning Perl, genetics, or both up to speed and working on real world problems quickly.


You can purchase Beginning Perl for Bioinformatics at Fatbrain. Want to see your own review here? Read the review guidelines first, then use Slashdot's webform.

This discussion has been archived. No new comments can be posted.
Display Options Threshold:
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • if only it was in italian... (Score:4, Funny)

    by joss (1346) on Tuesday January 29 2002, @10:05AM (#2919329) Homepage
    then I could learn perl, biology, and Italian all at the same time.
  • Heh (Score:3, Funny)

    by British (51765) <british1500 @ g m a i l.com> on Tuesday January 29 2002, @10:09AM (#2919352) Homepage Journal
    "You got your Perl in my biology!"

    "You got your biology in my perl!"

    Two great interests that interest great together!
  • Awesome. (Score:1)

    by sawilson (317999) on Tuesday January 29 2002, @10:10AM (#2919353) Homepage
    I like it when I see a "tie in" to another industry or scientific discipline. I could read this book, learn all about DNA, crack it with a perl script, then get served papers by $DEITY so I can be prosecuted under the DMCA.
    • Re:Awesome. by TooTallFourThinking (Score:1) Tuesday January 29 2002, @12:00PM
  • by TheCow (191714) on Tuesday January 29 2002, @10:11AM (#2919355) Homepage
    Now I can convert the code for my Terminator robot from Fortran 77 to Perl! Good bye columns!
  • statistical approaches (Score:5, Insightful)

    by ciole (211179) on Tuesday January 29 2002, @10:12AM (#2919357)
    I felt the same about the lack of statistical approaches. While this book is probably great for biologists just learning to write code, for coders entering the field (bioinformatics) it contains too little biology or math to be really educational. My opinion.

    What I'd love would be a dissection of the construction of various motif analysis tools, critiquing various impl's of HMMs, really going into detail. This seems like a perfect complementary work to OSS, so I might even find one, someday...
    • Re:statistical approaches by sql*kitten (Score:2) Tuesday January 29 2002, @10:25AM
      • Re:statistical approaches by ciole (Score:1) Tuesday January 29 2002, @10:44AM
      • Re:statistical approaches by babbage (Score:3) Tuesday January 29 2002, @10:50AM
      • Re:statistical approaches (Score:4, Informative)

        by Marcus Brody (320463) on Tuesday January 29 2002, @10:59AM (#2919563) Homepage
        why would you want to use Perl over a flat file data set

        Good Question. Answer is yes and no.
        Flat Files are really quite useful in biology (btw, when a biologist mentions a "database", he almost certainly mean a "flatfile"). DNA/RNA/Proteins are just a long sequence of letters, and therefore these are perfectly represented by good 'ol ASCII. This is particularly useful for means of distribution etc. When annotations are added to the data, they are traditionally added to the flatfile by way of an "annotation table", to keep the simple ease of ASCII.

        However, more advanced ways are used to store annotations of biological data, although traditional databases arent allways that good at expressing the rather messy, randomness of biology ;-) Therefore, specialised databases such as acedb [acedb.org] are quite useful and intuitive to the biological mind. Furthermore, projects such as ensembl [ensembl.org] (which ambitiously attempts annotations on the whole genome) store their data in an SQL database. However, they still make extensive use of perl to interact wiht the database.
        [ Parent ]
      • Re:statistical approaches by JimMcCusker (Score:1) Tuesday January 29 2002, @04:24PM
      • Re:statistical approaches by Untimely Ripp'd (Score:1) Wednesday January 30 2002, @04:32PM
    • Re:statistical approaches by dAzED1 (Score:1) Tuesday January 29 2002, @12:45PM
    • Re:statistical approaches by T. Will S. Idea (Score:1) Tuesday January 29 2002, @01:44PM
    • 1 reply beneath your current threshold.
  • I haven't read it myself but (Score:5, Insightful)

    by Theodore Logan (139352) on Tuesday January 29 2002, @10:15AM (#2919367)
    I have a number of friends in the business who have read that book. In summary:

    1) It is good for biologists who wants to learn how to program

    2) It is not good for programmers who want to learn biology

    Obviously, my friends disagree with reviewer Babbage on this point. However, a quick look on Amazon [amazon.com] reveals that most reviewers who found the book interesting are biologists with no programming experience instead of the other way round.

  • Flashbacks (Score:4, Interesting)

    by keiferb (267153) on Tuesday January 29 2002, @10:17AM (#2919377) Homepage
    Seeing a title like this, aiming a particular language at a particular discipline makes me flash back to the college days (last year) where the engineering classes all used fortran. God forbid, if perl gets outdated in another few years, are all the Biologists in the world going to lock themselves into a dead language like those stuffy engineers?
    • Re:Flashbacks (Score:4, Informative)

      by glwtta (532858) on Tuesday January 29 2002, @11:26AM (#2919728) Homepage

      I've worked in bioinformatics for the last few years, and I can say that there's a bit of a difference between bioinf and perl, and engeneering and fortran - perl is suited for bioinformatics far, FAR better than any other language. And so far the benefits of modern languages just can't seem to outweigh this innate suitability.

      Traditionally almost all bioinformatics tools have been done in perl, and they continue to be so, for one very simple reason - bioinformatics, when it comes down to it, is just plain text processing.

      Anyway, about the book itself - it's nice for biologists who want to learn something about programming, but I neither learned much about biology from it, nor am I afraid I will lose my job because all the bio people are gonna start doing their own programming :)

      [ Parent ]
      • Re:Flashbacks by T. Will S. Idea (Score:1) Tuesday January 29 2002, @01:08PM
        • Re:Flashbacks by glwtta (Score:2) Tuesday January 29 2002, @04:34PM
    • Fortran by wiredog (Score:2) Tuesday January 29 2002, @11:58AM
    • Re:Flashbacks by squidfood (Score:1) Tuesday January 29 2002, @06:13PM
    • Re:Flashbacks by lisam (Score:1) Tuesday January 29 2002, @01:08PM
    • 2 replies beneath your current threshold.
  • The challenge of Bioinformatics (Score:5, Informative)

    by nesneros (214571) on Tuesday January 29 2002, @10:19AM (#2919385) Homepage
    Bioinformatics is probably the biggest challenge facing the biological sciences in the next few years. Its becomming more and more apparent that even slight changes in very small elements of a system (i.e., a small sequence of a protein, the behavior of a single neuron within a group of 10,000) can have a drastic effect on the behavior of the entire system. As a result, to really study the problem, you have to aquire massive amounts of data. For example, in our lab we routinely collect data from 64 channels of 16-bit data (monitoring neuron firing in culture) at 1KHz, in addition, we're simultaneously taking calcium imaging video at 100fps at 256x256 (at 256 colors). This results in about 200 MB of data gathered every second. Considering we run tests for over 10 minutes, just aquiring and storing this data is a challenge, but finding useful methods to analyze it is even more difficult. Its refreshing to see texts being written on how to bridge the gap between comp. sci. and biology. I've been working in the area for about 4 years now, and its really great to see the field growing and getting more mainstream attention.
  • try it (Score:2)

    by oo7tushar (311912) <slash.@tushar.cx> on Tuesday January 29 2002, @10:19AM (#2919390) Homepage
    As a CS person about to switch into Biology I found the reviewed book interesting. Even if you have a good handle on Perl and Biology you will find certain elements in the book intruguing.
    On a personal experience side note, Perl does seem to handle genetics problems with quite a bit of ease. The ease seems to stem from Perl's obfuscation. (it also seems to confuse my Biology profs quite a bit since my answers are legitimate answers on the exams)
  • And (Score:2)

    by NMerriam (15122) <NMerriam@artboy.org> on Tuesday January 29 2002, @10:20AM (#2919399) Homepage
    They also don't mention it's a great introduction to books for those familiar with perl, biology, and bioinformatics, but not the written word!...
  • As a biologist... (Score:3, Interesting)

    by Ubi_NL (313657) <joris@ i d eeel.nl> on Tuesday January 29 2002, @10:22AM (#2919404) Journal
    We were just discussing programming languages recently.
    We use so-called micro-arrays frequently, which yield so much information it is not possible to go through all that manually (on average you get about 10.000 "genes" that show changes in expression, after which you have to check the intertesting ones for functionality).
    At the moment we can either mess around with MS excel or buy some serious software which is so incredibly expensive only companies can afford it.
    Still I doubt whether Perl should be the language of choice due to it tending to be "write-only code". Maybe this book will change my mind though.
  • by Anonymous Coward on Tuesday January 29 2002, @10:22AM (#2919407)
    This could spawn a great trend in cross-area programming books. Ada for Historians? Smalltalk for Hairdressers?
  • More for your library (Score:5, Informative)

    by chundercanada (520279) on Tuesday January 29 2002, @10:31AM (#2919443)
    I just spend a couple of days trying to choose a few books in this area. My interest was as a computer guy needing to get filled in on the bio side of things. Here are the books I ended up ordering:

    Human Molecular Genetics 2 [amazon.com]: Looks to be a great primer on all the biology background.

    Bioinformatics: A Practical Guide... [amazon.com]: This book is a detailed tour of the online databases and existing tools for analysis of genes and proteins.

    Algorithms on Strings, Trees and Sequences [amazon.com]: This is a book for real computer science types who want to do high-performance implementations of new tools.

  • Universities going this way (Score:5, Interesting)

    by Marx_Mrvelous (532372) on Tuesday January 29 2002, @10:40AM (#2919484) Homepage
    At Purdue University, there is a class specifically meant for CS majors and Biology majors, to address this same issue. I wonder if they use this book in the class.
  • by SloppyElvis (450156) on Tuesday January 29 2002, @10:42AM (#2919493)
    The BioPerl project [perl.org] (http://bio.perl.org/) has been going on for some time.

    In their own words they are, "The Bioperl Project is an international association of developers of open source Perl tools for bioinformatics, genomics and life science research."

    There bioinformatitians can find a wealth of useful Perl scripts and modules to use in their efforts.

    Yet another example of an open source initiative serving the needs of science!
  • by RevAaron (125240) <revaaron@@@hotmail...com> on Tuesday January 29 2002, @10:45AM (#2919506) Homepage
    This book seems to equate biology with genomics/bioinformatics, when that is simply not the case. There are a fair amount of scientists in the general school of biology who *are not* bioinformaticians. As a person who does computational ecology, this book really wouldn't help me- and I am a biologist. Sure, DNA is swell, but it won't tell us about the complex interactions between a number of populations of organisms and the environment in which they live; it doesn't provide strategies and formulas (or references to perl modules?) that *other* kinds of biologists use. ...sigh.
  • by thelen (208445) on Tuesday January 29 2002, @10:49AM (#2919520) Homepage

    What's the aim of this book, really? Is it meant to give the layperson in either field a hobby in the other? Are you supposed to read this and then go get a job in bioinformatics? As a Perl programmer with an interest in Biology but no formal training in it, I can say with certainty that it's not the latter. To land a job in that field you basically must have a graduate degree one of the two fields, preferably with significant formal education in the other as well.

    I might pick up this book because it sounds genuinely worthwhile, but I fully expect that at the end of it I'd feel more than anything that I needed to go back to school.

  • by tlh1005 (541240) on Tuesday January 29 2002, @10:53AM (#2919540)
    Odd for me that this story was on slashdot today. I've spent the last 24 hrs lurking around the net trying to find books that'll give me a little info on bioinformatics. Anyways, I have a CS degree and I am kicking around the idea of taking Biology classes. I know a tiny bit about Biology but not any significant amount at all. I was wondering if you guys could recommend some books for a programmer in terms of bioinformatics?? I've seen the recommendations on bioinformatics.org but I want some feedback from some of you knowledgeable slashdotters. Feel free to send email.....
  • Human DNA simulator in perl (Score:1, Funny)

    by Anonymous Coward on Tuesday January 29 2002, @10:54AM (#2919544)
    perl -e 'for (1..1000000) { print ${[G,T,C,A]}[int(rand() * 4)] }'

    -- This is my penis. There are many like it, but this one is mine.
  • by Mentifex (187202) on Tuesday January 29 2002, @10:54AM (#2919545) Homepage Journal

    Anyone who wanders into the use of Perl for bioinformatics ought to consider the ultimate plunge into the use of Perl for neuroscientific Artificial Intelligence. [develooper.com] Since v.t.y. Mentifex here has been coding the AI Brain-Mind in JavaScript [scn.org] for tutorial purposes and also in Forth for Intelligent Mind Roboinformatics, [scn.org] the switch-over to Perl is advancing so slowly that I must first promulgate some candidate AI module proposals for inclusion among the object-oriented Perl 5 Module List. [cpan.org]

    The Comprehensive Perl Archive Network (CPAN) [cpan.org] contains some not-yet-implemented, suggested AI module namespaces for those who read the Beginning Perl book reviewed here on SlashDot and who may then wish to do some really exciting, wave-of-the-future Perl neuroscience theory and practice work. [scn.org]

  • by pongo000 (97357) on Tuesday January 29 2002, @11:03AM (#2919587)

    If I have a complaint with the book, in fact, it's that Tisdall doesn't go any further: everything is good, but it ends too soon. Seemingly important topics such as OO programming...are mentioned only in passing, under "further topics" in the last chapter.

    Mabye that's because Perl's OO support is an extremely kludged-together ugly beast that's undergoing a much-needed facelift in Perl6.

    The author actually does the world a favor by not mentioning Perl and OO in the same sentence.
  • by pclminion (145572) on Tuesday January 29 2002, @11:31AM (#2919757)
    Why do scientists gravitate to these scripting languages? My guess is that scripting languages avoid several common things that non-programmers usually have a hard time with:
    • Variable declarations
    • Memory allocation
    • Type conversion
    Unless you're using Python in which case you have to do type conversion sometimes...

    Really, why scripting languages? It seems like some of these scientists are getting really good at it, using OO and everything. Why not switch over to a native language like C++ (which isn't actually that hideous if you avoid all the stupid features) and do the calculations 50 times faster?

    Anyone have input?

  • by tony clifton (134762) on Tuesday January 29 2002, @11:38AM (#2919795)
    In the San Francisco area, the Biotech companies are on a hiring swing. It's a notoriously hard area for even the strongest programmers to get a job in, unless they've worked in biotech before.

    Any indications if this book (or any of the others noted here) would be enough to get someone in the door?
  • There's gotta be some legit way to link the two. I aim to be more than just a consumer of both ;) It's time to give a little something back to both communities I feel, it's only polite...
  • by ubiquitin (28396) on Tuesday January 29 2002, @12:15PM (#2920025) Homepage Journal
    If you work on or with proteins (structural biology, biophysics, etc.) you will find this book to be largely a waste of time. An earlier slashdotter said it: there is more to biology than genomics. O'Reilly should stick to unix, leave the science for the peer-reviewed journals. Amen.

    P.S. If you want an intro to some field in biology, read up on TIBS (Trends in Biological Science for the uninitiated.)
  • Perl and Bioinformatics (Score:5, Informative)

    by fasta (301231) on Tuesday January 29 2002, @12:21PM (#2920065)
    I would like to answer several questions that were raised in this discussion.

    (1) How does a CS person learn biology? I recommend "Recombinant DNA, A short Course", as an accessible (Scientific American style) introduction to the cloning breakthroughs and discoveries that lead to genome science.

    (2) How does a CS person learn "Bioinformatcs"? I strongly recommend "Bioinformatics - Sequence and Genome Analysis" by David Mount as an accessible and extremely comprehensive survey of current approaches in Biological Sequence Analysis.

    (3) Why do Biologists use Perl? Much of the information Biologists want is on the WWW, and Perl's LWP makes it extremely easy to get it. We don't use Perl for sophisticated text analysis (similarity searching, motif searching, etc) because the algorithms that are appropriate are typically not exact (or even regular expression) matches. But it's difficult to beat Perl for getting stuff off the WWW.

    (4) Why do Biologists use Flat files? Several reasons - (a) the most useful information is sequence information, and it can be read much more quickly out of a flatfile (esp. one that is memory mapped) than a DB; (b) flat files solve some versioning problems that DB's make very complex and slow. (c) Most data providers only provide flatfiles. This will change, however, over the next 2 - 3 years, mySQL and postgresQL are moving into biology labs.

    It is very exciting that Bioinformatics has high visibility now, and many people with CS background are considering bioinformatics problems. Unfortunately, many of the introductory books on bioinformatics (particularly the O'Reilly books) do not adequately present the substantial foundations of bioinformatics that have been build over the past 15 - 20 years, and some newcomers are mislead into believing there are simple problems looking for a few good programmers. Most of the simple problems have been solved; many of the complicated problems are challenging not because we do not know enough CS, but because we do not know enough biology.
    • Mod this up!! by ChaosMt (Score:1) Tuesday January 29 2002, @01:33PM
  • Since I'm a Lisp fiend: while we're on the subject of programming for bioinformatics, I'd like to point out that Allegro Common Lisp [franz.com] has been used by a few folks in the field. Here are two links:

    Pangea Systems Inc. (now DoubleTwist) for EcoCyc [franz.com].

    MDL Information Systems to design new drugs [franz.com].

  • PubMed Books online (Score:2, Informative)

    by NullSpaceKid (50403) on Tuesday January 29 2002, @01:19PM (#2920407)
    A selection of possibly relevant books (_Introduction to Genetic Analysis_, Molecular Cell Biology_, etc) can be found at: www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Books [nih.gov] NSK
  • by Anonymous Coward on Tuesday January 29 2002, @03:36PM (#2921328)
    It seems that perl is still being used purely because many bioinformatics departments are full of people who know how to program in perl. And this is because bioinformatics *used* to be pretty much only about string manipulation.
    This is just not true any more - proteomics require in silico trypsin digest and algorithms for protein identification for MALDI mass spec (prediction of protein sequence via analysis of digested protein fragments); microarray experiments require cluster analysis of expression data in order to identify functinoal relationships. Added to this there are lots of issues relating to integrating the many many databases there are out there.
    The systems are becoming bigger and have to deal with lots of other systems around the world. Is Perl the best language for all this? I don't know but languages shouldn't be pushed into unsuitable roles purely for historical reasons and lots of bioinformaticians are trying to do this by trying to cling onto perl.

    martin
  • by cookie_cutter (533841) on Tuesday January 29 2002, @09:26PM (#2922974)
    The type of bioinformatics described in this book deals with processing long strings of symbols, which much biological sequence data is represented as(eg DNA, RNA and protein sequence data).

    There is another area of bioinformatics which uses physics based simulations of biological systems. These types of tasks have little to do with ascii file processing, and are more sheer number crunching, and involve classic simulation modelling techniques.

    Some examples of these types of bioinformatics problems are:
    -simulation of protein folding
    -simulation of chemical reaction circuits/control mechanisms in a cell or organ system
    -cellular automata simulation of a group of cells in a tissue

    Because of the number crunching requirements involved, these types of tasks are usually coded in languages which are good at math and have fast compilers, such as fortran and C.

    I'm just trying to mention what else is out there, so that people don't get the idea that pattern parsing is the only thing bioinformaticists do

  • Re:out of touch (Score:2)

    by pclminion (145572) on Tuesday January 29 2002, @11:26AM (#2919725)
    The problem with doing OO is it requires you to understand software engineering. Biologists are probably more interested in crunching their numbers than in good OO design. Or am I wrong?

    If biologists are learning how to do OO, maybe I should get out my old chemistry set and try some gene-splicing :)

    What's with that last couple of sentences? Did you fall out of bed this morning?

    [ Parent ]
  • 14 replies beneath your current threshold.