Data scientists are this bubble's web masters. 'Nuff said.
For the sake of this discussion, mosquito borne malaria is a warm weather problem. Increased deaths from cold weather, which was the parent's straw man, occur when it's really cold (sub freezing). Mosquitos die when it freezes. Sure, they can be a problem even when it's cold, but not when it's deadly cold.
How about just remove the camera? That's the creepiest part of Google Glass.
I'm all for exploring the potential of having a display in my line of site for getting information on demand or for AR applications. You don't need a camera for either of those. For AR, the GPS in the phone gives you position, accelerometers in the headset give you orientation, and public database of roads and buildings gives the apps spatial awareness. If you want to be able to highlight people or cars, they could 'opt in' to a location sharing feature that publishes their coordinates.
Battery life would probably be much better w/o the camera as well.
31,000 extra deaths due to cold weather and the flu in 2013:
584,000 deaths due to malaria in the same year:
Malaria is transmitted by mosquitoes, which rely on warm weather to live. And that's just one warm weather related cause of death that will go up as the planet warms.
Look, I'm a huge Python and open source advocate and use it for almost everything I do, but the "proprietary" argument doesn't hold any water. VB, and Microsoft's languages in general, have seen more long term support than any open source language. They have consistently had a level of commitment to backwards compatibility and long term support that no open source language implementation can match. Sure, with an open source language you can fix problems yourself*, but if there's good support from the vendor, as is the case with MS, you never need to.
You're going to need to give a much better reason than "proprietary" to discount the VB argument. There are lots of good ones, but this isn't one.
*though I'd argue that there are only a few of us out there with the chops to actually do that
We develop complex scientific software and made the decision to go HTML/JS for all our client code a few years ago and haven't regretted it. It takes a little bit of learning the libraries, but there are some good mature ones available to make streamline development.
*I've used Django and Tornado+SQLAlchemy extensively for this.
MPI is definitely for very specific problems and really isn't what I'd consider "conventional" cluster programming. Most people associate MPI with clusters and parallel computing, but if you look at what's actually running on most big clusters, it's almost always just batch jobs (or batch jobs implemented using MPI
Interestingly, all my examples were on genomics problems (processing SOLiD and ION Torrent runs). We started going down the Hadoop path because we thought it'd be more accessible to the bioinformaticians. But, once we saw the performance differences (and, importantly, understood the source of them) we abandoned it pretty quickly for more appropriate hardware designs (fast disks, fat pipes, lots of RAM, and a few linux tuning tricks -- swappiness=0 is your friend). Incidentally, GATK suffered from these same core performance problems. The original claims that the map-reduce framework would make GATK fast were never actually tested, just simply claimed in the paper. GATK's performance was always been orders of magnitude less than the same algorithms implemented without map-reduce. But, it's from the Broad, so it must be perfect.
I like sector and sphere. We also did a POC with them and they performed much better than the alternatives. Unfortunately, they also required very good programmers to use effectively.
More importantly, why did we need Hadoop when we already had [your_favorite_language] + [your_favorite_job_scheduler] + [your_favorite_parallel_file_system]?
Seriously, standard HPC batch processing methods are always faster and easier to develop for than latest_trendy_distributed_framework.
The challenges of data at scale* are almost all related to IO performance and the overhead of accessing individual records.
IO performance is solved by understanding your memory hierarchy and designing your hardware and tuning your file system around your common access patterns. A good multi-processor box with a fast hardware raid and decent disks and sufficient RAM will outperform a cheap cluster any day of the week and likely cost less (it's 2015, things have improved since the days of Beowulf). If you need to scale, a small cluster with Infiniband (or 10 GigE) interconnects and Lustre (or GPFS if you have deep pockets) will scale to support a few petabytes of data at 3-4 GB/s throughput (yes, bytes, not bits). You'd be surprised what the right 4 node cluster can accomplish.
On the data access side, once the hardware is in place, record access times are improved by minimizing the abstraction penalty for accessing individual records. As an example, accessing a single record in Hadoop generates a call stack of over 20 methods from the framework alone. That's a constant multiplier of 20x on _every_ data access**. A simple Python/Perl/JS/Ruby script reading records from the disk has a much smaller call stack and no framework overhead. I've done experiments on many MapReduce "algorithms" and always find that removing the overhead of Hadoop (using the same hardware/file system) improves performance by 15-20x (yes, that's 'x', not '%'). Not surprisingly, the non-Hadoop code is also easier to understand and maintain.
tl;tr: Pick the right hardware and understand your data access patterns and you don't need complex frameworks.
Next week: why databases, when used correctly, are also much better solutions for big data than latest_trendy_framework.
*also: very few people really have data that's big enough to warrant a distributed solution, but let's pretend everyone's data is huge and requires a cluster.
** it also assumes the data was on the local disk and not delivered over the network, at which point, all performance bets are off.
For small genomes, yes, but for large genomes, there is a lot of "unused" material.
Only about 6-10% of the human genome is transcribed into RNA, either protein the coding kind or non-coding types used in regulation. (small genomes are almost always entirely coding and even include overlapping coding regions, large genomes are the ones that have "junk" DNA in them)
Transcription is most closely related to a processor reading machine code and doing something with it. In a computer program, we know that we can safely remove dead code paths and the code will still function. This is not true for DNA. Remove a portion of someone's genome and they usually die.
It's much more likely that the "junk"/"noise" regions of the genome are structural and help the DNA coform so the chromosomes can specialize for different functions. DNA folds differently depending on the cell type in multicellular organisms. Because the nucleus of a cell is a fairly crowded place, the way the DNA folds determines which sites on it are even accessible for transcription. Muscle cells expose one set of gene coding regions, fat cells expose another.
Taken from this perspective, large genomes are more akin to an origami fortune teller than machine code. Depending on the series of folding/unfolding events, a specific fortune is revealed. The fortunes are encoded directly onto the paper, but the paper also forms the structure used to access the fortunes. Another actor reads the instructions and acts on them (a person in the origami case or polymerase for DNA).
I do similar systems for genomics. Despite all the hype around cloud services in our space, we're finding more interest in local copies of the standard databases with links out to the canonical sources as needed. The local copies keep hospital IT happy and ensure access if the network is wonky.
And, it turns out that most clinicians are comfortable sorting through database records on their own and don't like magic algorithms attempting to do it for them. Access to the basic data is what they want.
Those pictures are amazing! I immediately was taken back to playing Super Mario Galaxy and imagined Mario running around the comet.
The first few weeks of any training program typically suck. That's where willpower (or encouragement if you're in a group) plays such an important role.
Once I'm passed the initial hump, I always feel the "addictive" need to get more exercise and chase the high. In my specific case, the "high" comes after sustained exertion in the med/high effort range. I rarely see it biking (I'm a bike commuter and never really push myself). But running, climbing, mountaineering, and snowboarding all bring it out. For running, on long runs at a moderate pace it kicks in around mile 5 or 6. For short, faster runs, it kicks in about 30 minutes after the run and lasts for a few hours. Other sports have similar patterns. In my experience, the feeling is most similar to hydrocodone (which, unfortunately, I also know about from running).
Wikipedia's description of the "runner's high" covers some of the suspected mechanisms for it.
A big reason we accept the trade offs of modern processors is that it's generally easy to program a broad range of applications for them.
In the mid aughts (not very long ago, actually), there was a big push for heterogeneous multi-core processors and systems in the HPC space. Roadrunner at Los Alamos was a culmination of this effort (one of the first petascale systems). It was mix of processor types including IBMs Cell (itself a heterogeneous chip). Programming Roadrunner was a bitch. In having different processor families, you had to decompose your algorithm to target the right processor for a given task. Then you had to worry about moving data efficiently between different processors.
This type of development is fun as an intellectual exercise, but very difficult and time consuming in practice. It's also something compilers will never be good at, requiring experts in the architectures, domains, and applications to effectively use the system.
Another lesson from the period (and one that anyone whose done asics has known for years) is that general purpose hardware generally evolves fast enough to catch up with specialized hardware with a reasonable timeframe (usually 6-18 months, see DE Shaw's ASIC for protein folding as an example).
While custom processors are cool (I love hacking on them), they're rarely practical.
So, there is some truth to the 3 month number. I learned C from a minimal programming background in one semester as an undergrad, or about three months. Of course, 20 years later I'm still refining my skills. The rest of the CS degree gave me a much more solid foundation than I'd have if I had gone straight to work after learning C. Surprisingly, basic theory like complexity analysis come in handy when building applications.
Open Standards and Protocols are what this space needs, along with regulations requiring vendors to allow interoperability for free or a nominal fee.
Open Source software, on the other hand, won't really solve any problems. Someone has to write the software and vet it. EHR software isn't an itch people typically want to scratch. Of course, an EHR platform could leverage Open Source software for development. A Web-based EHR could use an entire Open Source stack and even contribute libraries for protocol support.
Open Source is great for infrastructure components, not so great for user-facing applications. At some level in the stack, someone needs to do the UX work, testing, and validation to create an application people can actually use.
I would never advocate for a fully Open Source solution for EHRs or any other complex, user-facing software, but I would put incentives in place to leverage as much Open Source in the stack as possible. Plus, any company that does that right will have much cheaper dev costs and will be able to undercut the competition a bit (though for supported software, dev costs are usually only 10-20% of the costs, with support, marketing, sales, etc taking up the bulk of the costs).