Interactive data analysis with terabytes of data

btx468.png

Interactive Big Data analytics

using Cloud technologies

Biomedical Big Data and associated research is rapidly finding home on public clouds. Part of this comes from maturation of cloud technologies and part comes from the desire to playing well in a research community. Cloud democratizes availability of affordable tools to the broader community. As little as ten years ago, doing large scale computing required building dedicated data centers. These data centers are still a lot more cost effective than Cloud but Cloud takes away the need for upfront investment thus making is possible to explore first.

In this publication, (originally published on bioRxiv pre-print server) use of interactive analytics using Google BigQuery is demonstrated for terabyte scale genomic data. What makes this approach different? Firstly, interactive analysis mode brings unprecedented power compared to batch mode analysis. Data exploration, by definition, is iterative. New exploration strategies are often based on intuition gained from previous exploration. If you can reduce the cost sufficiently and make the exploration real time (or near real time), then one explores faster and further. 

"Google wants to store your genome" - MIT Tech Review Link

"Google wants to store your genome" - MIT Tech Review Link

Figure from original submission to bioRxiv preprint server.

Figure from original submission to bioRxiv preprint server.

Due to nature of Big Data, it may not possible to move data around unless you are affiliated with universities that are on fast internets. So the authors here demonstrate use of Cloud environment for a range of genomic data exploration, all the way from variant calling to QA, GWAS and use of machine learning methods.

The following quote was part of a Google Cloud blog that summarized adoption of Google Cloud at Stanford University.

“We’re entering an era where people are working with tens of thousands or even millions of genome projects, and you’re never going to easily do that on a local cluster. Cloud computing is where the field is going.”
— Mike Snyder, PhD, Director, Stanford Center for Genomics and Personalized Medicine