Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×

Comment Re:Age vs experience... (Score 1) 233

Since this thread is about hiring talented developers for fun projects, I'll throw this out (using the criteria from the parent):

Interesting: genome sequencing (an actual "big data" problem that's not just about selling stuff to people more effectively)
New location: Austin, TX

http://www.lab7.io/jobs/

-Chris

Comment Re:Bioinformatics Bubble? (Score 4, Interesting) 38

I run a bioinformatics software company, have been in the field for over a decade, and have worked in scientific computing even longer.

I'll start with a quick answer to the bubble question: there are already too many 'bioinformatics' grads but there are not enough bioinformatics professionals (and probably never will be). There are many bioinformatics Masters programs out there that spend two years exposing students to bioinformatics toolsets and give them cursory introductions to biology, computer science, and statistics. These students graduate with trade skills that have a short shelf life and lack the proper foundations to gain new skills. In that respect, there's a bubble, unfortunately.

If you're serious about getting into bioinformatics, there are a few good routes to take, all of which will provide you with a solid foundation to have a productive career.

The first thing to decide is what type of career you want. Three common career paths are Researcher, Analyst, and Engineer. The foundational fields for all are Biology, Computer Science (all inclusive through software engineering), and Statistics. Which career path you follow determines the mix...

Researchers have Ph.D.s and tend to pursue academic or government lab careers. Many research paths do lead to industry jobs, but these tend to morph into the analyst or engineer roles (much to the dismay of the researcher, usually). Bioinformatics researchers tend to have Ph.D.s in Biology, Computer Science, Physics, Math, or Statistics. Pursing a Ph.D. in any of these areas and focusing your research on biologically relavent problems is a good starting point for a research career. However, there are currently more Ph.D.s produced than research jobs available, so after years in school, many bioinformatics-oriented Ph.D.s tend to end up in Analysis or Engineering jobs. Your day job here is mostly grant writing and running a research lab.

Bioinformatics Analysts (not really a standard term, but a useful distinction) focus on analyzing data for scientists or performing their own analyses. A strong background in statistics is essential (and, unfortunately, often missing) for this role along with a good understanding of biology. Lab skills are not essential here, though familiarity with experimental protocols is. A good way to train for this career path is to get an undergraduate degree in Math, Stats, or Physics. This provides the math background required to excel as an analyst along with exposure to 'hard science'. Along the way, look for courses and research opportunities that involve bioinformatics or even double major in Biology. Basic software skills are also needed, as most of tools are Linux-based command line applications. Your day job here is working on teams to answer key questions from experiments.

Bioinformatics engineers/developers (again, not really a standard term, but bear with me) write the software tools used by analysts and researchers and may perform research themselves. A deep understanding of algorithms and data structures, software engineering, and high performance computing is required to really excel in this field, though good programming skills and a desire to learn the science are enough to get started. The best education for this path is a Computer Science degree with a focus on bioinformatics and scientific computing (many problems that are starting to emerge in bioinformatics have good solutions from other scientific disciplines). Again, aligning additional coursework and undergraduate research with biologists is key to building a foundation. A double major in Biology would be useful, too. To fully round this out, a Masters in Statistics would make a great candidate, as long as their side projects were all biology related. Your day job here is building the tools and infrastructure to make bioinformatics function.

All three career paths can be rewarding and appeal to different mindsets.

If you haven't followed the NPR series on gene sequencing over the last few weeks, it's definitely worth listening to. I also did a talk a few years back at TEDxAustin on the topic that makes the connection between big data and sequencing ( http://bit.ly/mueller-tedxaustin ). Affordable sequencing is changing biology dramatically. Going forward, it will be hard to practice some parts biology without sequencing and sequencing needs informatics to function.

Good luck!

-Chris

Comment Agile Manifesto (Score 2) 491

In reading the comments, it's clear that many people don't know the roots of agile software development. In short, agile (note the lower case) development is basically a set of of principles laid out by a group of very talented developers in the late nineties in their agile manifesto:

http://agilemanifesto.org/

Note that the manifesto makes no mention of Extreme Programming, Scrum, or any of the other capital-A Agile methods. Instead, it focuses on observations about what made their software projects successful. Itpecifically doesn't prescribe any specific methodology, but rather encourages communication, iteration, and excellence in design and engineering. The last two points come from this section of the manifesto:

"Continuous attention to technical excellence
and good design enhances agility."

The manifesto very much allows for, and even encourages, design. It also assumes that the practitioners are already experienced developers who know how do design software and know how much design is needed before coding. Unfortunately, the most Agile methods traded experience for certified training and the 'technical excellence' portion was lost.

I've worked with many talented teams and have seen agile work time and again. Of course, all of those projects did have design, documentation, and tests. But, all those artifacts were developed using the same principles in the manifesto.

-Chris

Comment iPhone + BT keyboard + HDMI/DVI adapter (Score 2) 339

I've been using that combo more often for conferences and business meetings. If you want more screen, an iPad or galaxy tablet would work.

I like the iPhone approach since it limits me to a single device for everything (except coding). Keynote works great for presenting (I usually author in PowerPoint).

-Chris

Comment Re:This is what I like about Microsoft (Score 5, Insightful) 118

The big difference is that Microsoft Research is one of the last large corporate research labs focused on pure research. That is, research done for the sake of the research, not to drive product development. Research done at MSR doesn't have to be product driven (it has to be in the general space of software and computers, but that's about the only requirement). MSR is well funded by Microsoft and an integral part of the company's culture.

Sure, IBM, HP, and Intel all have research labs, but their charters have been re-written over the last ten years to focus more on product-centric research. Most research projects at these companies must start with a business plan that shows how the work will be commercialized within 5 years before being approved. This is not the pure research these labs were once known for.

Google, Facebook, Yahoo, and many other internet companies have some interesting projects (self driving cars, for instance), but these tend to be one-off projects and aren't part of a larger, long lived research organization.

Another interesting aspect of MSR is that they encourage all MS developers to take a stint in the organization, not just specially recruited Ph.D.s. It's not uncommon for someone to go from working on a product for a few years, take some time in MSR, then go back to product work.

I've worked directly with many of the research groups mentioned in this post over the last 20 years. Based on my experiences, MSR is truly the last real corporate research group (in the spirit of 20th century PARC/Watson/et al). The others are just part of the product funnels or whims of the founders.

-Chris

Submission + - Just what is 'Big Data'?

rockmuelle writes: I work in a 'Big Data' space (genome sequencing) and routinely operate on tera-scale data sets in a high-performance computing environment (high-memory (64-200GB) nodes, 10 GigE/IB networks, peta-scale high-performance stroage systems). However, the more people I chat with professionaly on the topic, the more I realize everyone has a different definition of what consitutites big data and what the best solutions for working with large data are. If you term yourself a 'big data' user, what do you consider 'big data'? Do you measure data in mega, giga, tera, peta-bytes? What is a typical data set you work with? What are the main algorithms you use for analysis? What turn-around times are typical for analyses? What infrastructure software do you use? What system achitectures work best for your problem (and which have you tried that don't work well?)?

Comment Work with your company's legal team (Score 1) 467

I work under a similar, very restrictive IP agreement. I raised the issue of side projects with the corporate lawyer in charge of IP and explained the types of projects I do on the side for fun and profit. While the company does not grant blanket exclusions, they were happy to review them on a project by project basis and grant exceptions.

Their goal was to protect the company's business using standard legal tools. Just like my job requires me to use my skils to the fullest, so does theirs. However, talking through it made it clear that there was no malicious intent.

One important thing to know when doing this: the lawyers represent the company and are ethically bound to put the company's interests first. They won't be able to give you any legal advice. You may want to talk to a lawyer first, just so you have outside counsel.

Also, this is just business for the company. The more you treat as business (and not good vs evil), the better chance you'll have of success.

-Chris

Comment TEDx Talk on the Subject (Score 3, Informative) 239

I did a talk on this a few years back at TEDx Austin (shameless self promotion): http://www.youtube.com/watch?v=8C-8j4Zhxlc

I still deal with this on a daily basis and it's a real challenge. Next-generation sequencing instruments are amazing tools and are truly transforming biology. However, the basic science of genomics will always be data intensive. Sequencing depth (the amount of data that needs to be collected) is driven primarily by the fact that genomes are large (e. coli has around 5 M bases in it's genome, humans have around 3 billion) and biology is noisy. Genomes must be over-sampled to produce useful results. For example, detecting variants in a genome requires 15-30x coverage. For a human, this equates to 45-90 Gbases or raw sequence data, which is roughly 45-90 GB of stored data for a single experiment.

The two common solutions I've noticed mentioned often in this thread, compression and clouds, are promising, but not yet practical in all situations. Compression helps save on storage, but almost every tool works on ASCII data, so there's always a time penalty when accessing the data. The formats of record for genomic sequences are also all ASCII (fasta, and more recently fastq), so it will be a while, if ever, before binary formats become standard.

The grid/cloud is a promising future solution, but there are still some barriers. Moving a few hundred gigs of data to the cloud is non-trivial over most networks (yes, those lucky enough to have Internet2 connections can do it better, assuming the bio building has a line running to it) and, despite the marketing hype, Amazon does not like it when you send disks. It's also cheaper to host your own hardware if you're generating tens or hundreds of terabytes. 40 TB on Amazon costs roughly $80k a year whereas 40 TB on an HPC storage system is roughly $60k total (assuming you're buying 200+ TB, which is not uncommon). Even adding an admin and using 3 years' depreciation, it's cheaper to have your own storage. The compute needs are rather modest as most sequencing applications are I/O bound - a few high memory (64 GB) nodes are all that's usually needed.

Keep in mind, too, that we're asking biologists to do this. Many biologists got into biology because they didn't like math and computers. Prior to next-generation sequencing, most biological computation happened in calculators and lab notebooks.

Needless to say, this is a very fun time to be a computer scientist working in the field.

-Chris

Comment Queasy... (Score 1) 332

I get queasy just thinking of coding on a ship for an hour, let alone a few months or a years. Maybe the Caribbean or the Gulf of Mexico might work, but anywhere at sea subject to swells that have had thousands of miles to mature can't be that conducive to coding. And, if you can tolerate it, you'll make more money on an oil rig.

-Chris

Comment Re:Possible use... (Score 2) 412

That's more likely the slice of images taken from the satellite's path. I suspect the satellite imaged that region when the channels/roads/whatever had a layer of water on top and were reflecting the sunlight. If you look at the adjoining tiles, there's still a channel structure, it's just not reflective.

-Chris

Comment Phidgets (Score 2) 147

http://www.phidgets.com/

I've used Phidgets in the past for exactly this application (research into UIs for large data). Lots of premade USB controls available and easy to hook up most analog controls to their IO boards. I went to the local electronics shop and bought a slew of buttons, knobs and slides and had no problem hooking them up with phidgets.

For programming, I wrapped the C library in Python using SWIG.

-Chris

Comment HPC Applications (Score 1) 314

A number of HPC applications funded by NSF/DARPA/DOE grants are able to provide a continued source of new research while maintaining and improving the applications.

One example is OpenMPI. BLAS/LINPACK/LAPACK are also examples. Some of the C++/Boost libraries also are maintained in academic, such as the Boost Graph Library.

-Chris

Comment DNA, RNA, and Genes (Score 5, Informative) 173

The judge's reasoning in the ruling hinges on the fact that the BRCA1/2 genes do not appear in nature as isolated, unmodified DNA and instead only appear in DNA form as part of a (much) larger chromosome. While technically true, it ignores an important fact of genomics: while the BRCA genes do not appear in vivo as isolated _DNA_, the do appear as isolated _RNA_. The RNA counterpart of the DNA sequence is slightly modified - it is the 'reverse-complement' of the DNA with the T's replaced with U's (for example, AACC - (reverse complement) -> GGTT - (sub U for T) -> GGUU.

So, in a very perverse way, the judge is correct. The isolated, unmodified DNA does not appear in nature.

There is natural mechanism for converting RNA back into DNA called reverse transcription (RT). RT-based methods are how we sequence genes. RNA from genes is isolated and converted back into DNA for sequencing. This is a standard lab method and used for all gene sequencing. (interestingly, if someone were to find RT at work in a cell converting BRCA genes back to DNA, the patent could be invalidated.)

The gene itself, in RNA form, appears isolated in nature. The RNA sequence cannot be patented. But, sequencing methods all rely on converting RNA back to DNA for sequencing. The sequence is read as DNA. But, that's not really the gene, that's just a modified representation of the gene. The functioning gene is the RNA version, not the DNA copy of it.

What's frustrating is that Myriad is using a technical aspect of how gene/RNA sequencing works to claim a patent on a gene itself.

-Chris

Comment Re:Depends... (Score 1) 244

"BTW, a conference publication isn't considered a "journal" publication, and doesn't confer the same status. Conferences are where the work gets done: people present developing ideas and get feedback on them."

This is true for all fields except computer science. In CS, conference publications form the basis of a publication record and journals tend be be more for 'archival' bodies of work. CS conferences are peer reviewed at the same standards as most journals in other fields.

CS is lacking journals with quick turnaround times and journals that accept incremental work. Articles submitted to CS journals often go through a year or more of reviews and revisions whereas the entire cycle for a conference from submission to publication is typically around 8 months. In CS, incremental work, which forms the bulk of most publishing records, is communicated through conferences, not journals. There is no equivalent of Physical Review Letters in CS.

Of course, this presents challenges and opportunities for CS academics. The main challenge is that few tenure committees (not departmental, but once the tenure case is presented to the larger committee) understand this and it always takes some explanations when making the case for a strong CS professor who's only pubs are at conferences such as POPL or SIGGRAPH (for non-CS people, publishing in those venues is right up there with Science or Nature in terms of importance). The opportunity for clever tenure track computer scientists is to partner with a physicist or biologist and rack up a number of co-authorships on papers in incremental journals. Oddly, CS people tend to think that _all_ journals are archival and difficult to publish in, so pubs in journals with collaborators helps your CS case.

It's a silly world...

-Chris

Slashdot Top Deals

He has not acquired a fortune; the fortune has acquired him. -- Bion

Working...