Usable, Collaborative, Reproducible, and Extensible
Four Key Tenets of Cloud Computing
The Seven Bridges Cancer Genomics Cloud (CGC) is one of three pilot systems funded by the National Cancer Institute with the aim of co-localizing massive genomics datasets, like The Cancer Genomics Atlas (TCGA), alongside secure and scalable computational resources for analysis.
TCGA comprises multidimensional matched tumor-normal data from over 11,000 patients and 33 cancer types. One of our key aims when designing the CGC was to maximize the impact of TCGA and empower cancer genomics analysis.
With this goal in mind, we developed four guiding principles for building the CGC and other large genomics projects.
1. Making data available isn’t enough to make it usable
TCGA contains over a petabyte of data, which is made available to authorized researchers. However, as TCGA has grown, it has become more difficult to use and to learn from. Downloading the data for local analysis is time-consuming, and storage is expensive. By contrast, CGC users can immediately begin to explore and analyze TCGA data in the cloud. Even better, researchers don’t need to worry about the cost of storing the dataset.
In addition to facilitating access, we have also worked to make understanding TCGA data easier. We have developed a semantic knowledge base with over 140 properties about cases, samples, and files that can be accessed through visual and programmatic queries. Users can also use our case explorer to investigate the data at per-gene or per-disease levels.
2. The best science happens in teams
Team science is becoming the norm in many research disciplines, and especially in genomics and medical research. Cutting edge projects and insights increasingly come from large consortia and multidisciplinary teams.
The CGC is built for teams, with shared project spaces, collaboration as a default, and fine-grained permission levels for managing access to different parts of the project. As an NIH Trusted Partner, Seven Bridges can authenticate and authorize approved users for access to TCGA data. Once authorized, approved researchers can collaborate with other researchers in a secure and compliant manner.
3. Reproducibility shouldn’t be hard
Reproducibility is a core principle of the scientific method, and each analysis is reproducible and “rememberable” on the CGC by default. Every task is replicable thanks to inbuilt recording of tool version and parameter settings. It’s easy to return to a particular analysis at a later date and understand precisely how output files were generated.
In addition, the CGC has been developed so that workflows are customizable and sharable. Even the most complex workflow is recorded as a simple text file using the Common Workflow Language (CWL) standard, meaning it is easily reproduced and shared.
4. The impact of TCGA is extended by new data and tools
The impact of TCGA is enhanced by combination with other data and tools. Mindful of this, we have made it easy to add data to the CGC and to easily annotate it so the data can be given meaningful properties for the analysis at-hand.
We have also made available over 200 analysis tools and workflows on the CGC, and we welcome feedback from the community to help us prioritize additional tools. Nonetheless, CGC users are not limited to our curated tools — they can also bring their own tools to the CGC, easily defined using CWL.
Current status
To date, more than 500 researchers from across the world have used the CGC to analyze cancer genomics data in a scalable, collaborative, reproducible, and extensible manner. You can join them by creating an account at www.cancergenomicscloud.org. More than one million dollars in compute and storage credits, including support for large collaborative projects, are available to researchers using the CGC.
Once you’ve created an account, watch one of our webinars or check the extensive documentation to get started learning from TCGA data. If you have any questions along the way or suggestions for future improvements, we encourage you to write us at cgc@sbgenomics.com or join us at one of the upcoming seminars, workshops, or meetups.
This post was originally published on the National Cancer Institute Biomedical Informatics Blog.