Building a Cancer Genomics Cloud Pilot
An interoperable cloud platform to meet the next level of challenges in cancer genomics research
We are delighted to announce that we were recently selected to participate in the Cancer Genomics Cloud Pilots project funded by the National Cancer Institute. The goal of this 24 month project is to democratize and speed cancer research by enabling all researchers to leverage cloud-computing technologies to interrogate petabyte-scale cancer genomics data.
This is just the first of a series of blog posts. Over the next months, we hope that you will join us here as we delve deeper into the project timeline, the scale of the challenge, and the approaches the community is taking to the project.
The Cancer Genomics Data Challenge:
Genomics data is generated by both individual research groups and through the efforts of large consortia. Unsurprisingly, as the cost of generating data has fallen, the quantity of genomic data available to researchers has grown exponentially [1]. Large scale, integrative analysis of data from multiple sources holds the promise to reveal new insights into human health and disease. However, the size of these datasets precludes most researchers from fully exploring them using currently available computational infrastructure.
The challenge of integrative analysis of large datasets is perhaps most obvious in the case of The Cancer Genome Atlas (TCGA). Through a national and collaborative effort, members of the TCGA consortium seek to generate comprehensive genomic maps of more than 30 different types of cancer. For each type of cancer, tumor and matched normal tissues from hundreds of individuals are analyzed across multiple dimensions [2]. The goal is to allow genomic changes to be correlated with patient outcomes and, in turn, improve treatment strategies.
The total data contained in TCGA is expected to reach 2.5 petabytes by the end of 2014. Simply downloading this dataset to a local drive would require months for the average biomedical research facility [3]. Once downloaded, further analysis of this dataset would still require computational resources available only to the select researchers with access to large institutional compute clusters. Clearly the approach of bringing the data to local machines for analysis is untenable with datasets of this size.
Our approach:
The challenges described above can be largely addressed by co-localizing the data with the computational power to analyze it and by “bringing the apps to the data” — this is the fundamental promise of cloud computing. Seven Bridges Genomics was founded on delivering this premise in 2009. Since then, we have developed a robust ecosystem to automate computational resources for storage and execution. This means providing compute nodes and storage; on demand and in bursts. This allows costs to be minimized and enables researchers to seamlessly access the computational resources required for their analysis.
Over the past five years, we have also focused on the difficult challenge of promoting reproducible bioinformatics research. Computational methods of analyzing this data are continually multiplying as the methodologies to generate genomic data have undergone a Cambrian explosion with the advent of next generation sequencing. Ensuring reproducibility has become all the more important to keep pace. Since different tool versions can provide different results, tracking the precise versions of tools used during an analysis is critical to support reproducible research, and eventually, clinical application of these technologies. We version and store all executions so that results are always associated with a complete snapshot of the tool versions, parameters, and input files [4].
We have created a software development kit (SDK) to enable tool developers to easily contribute new methods that can be released on the Cancer Cloud Pilot for anyone to use. The SDK is based on an open source and community-supported initiative and uses Docker machine images to facilitate sharing of reproducible workflows on any platform.
Our plan for the cancer genomics cloud is to combine three core elements: (1) co-localized compute/storage infrastructure i.e. the cloud, (2) a bioinformatics “operating system” that manages cloud resources, tools, pipelines, and data in a user-friendly way, and (3) the cancer genomics data that these systems operate on.
Join us:
As part of the Cancer Genomics Cloud Pilots project, we will build upon our cloud genomics expertise to create a public Cancer Cloud that can meet the next level of challenges in big data cancer genomics. This is the community’s Cancer Genomics Cloud, supported by the NCI [5]. All development work for the project will be open source. So with all this in mind – let’s get started! Please join the community and offer your input – we look forward to building the features most needed by cancer researchers.
- See figure. Raw data accessed from SRA and NHGRI, October 2014.
- Assays in TCGA include imaging, microsatellite instability analysis; DNA sequencing; total-, mRNA-, and miRNA-sequencing; Protein expression analysis and proteomics; Array based expression analysis; DNA methylation analysis; & Copy number variation analysis.
- Assuming a standard institutional Internet connection supporting transfer rates of 1 gigabit/second.
- In other words, the complete provenance of any result file is automatically captured. This allows re-execution of the analysis with the exact parameters and tool versions months or years after the initial analysis.
- This project has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, Department of Health and Human Services, under Contract No. HHSN261201400008C