Hacking cancer in the cloud
Cancer research in the cloud
Earlier this month, budding hackers from across the US gathered at the Seven Bridges offices in Cambridge, MA. They came to spend the weekend building and applying tools for cancer research using the Cancer Genomics Cloud (CGC).
The CGC brings together data from The Cancer Genome Atlas (TCGA) and other sources, along with computational resources, curated pipelines, and visualization tools. It is a powerful resource for researchers to explore and analyze massive cancer data sets.
The hackathon took place in collaboration with the National Institutes of Health Big Data to Knowledge (BD2K) initiative. We hosted biomedical researchers, data scientists, clinicians, students and professors who collaboratively coded through the weekend to collectively advance our understanding of cancer.
Breaking the ice
Friday evening kicked off with a plenary talk and networking session. We welcomed George Church, Professor of Genetics at Harvard Medical School and Director of the Personal Genome Project as our speaker.
https://twitter.com/SBGenomics/status/716039927683805184
Prof. Church spoke on the future of genomics, including the need to minimize off-target effects in gene editing, and the need for preventative medicine for cancer. He emphasized the potential for hackathons to be incubators for biological hypotheses that can be tested at the laboratory bench.
After the formalities, we broke for refreshment and networking. Coders from across the US began to mingle and form nascent teams. Seven Bridges staff who volunteered for the evening also joined in the proceedings.
https://twitter.com/SBGenomics/status/716023577749925888
The cancer hackathon
On Saturday the serious work began. The attendees formed groups of 3–6 people, some with defined aims they came to tackle that weekend, and others coming together to work on an idea they came up with during the hackathon.
Since some participants were new to TCGA, they were first shown an overview of the dataset using the data visualization features on the CGC. They then explored the Software Development Kit to create their own custom, portable tools described in the Common Workflow Language.
We’ve built the CGC to be accessible to researchers, and it was fantastic to see that people were quickly up and running on the system, and beginning to explore the data and tools on offer.
Throughout the weekend we held a number of workshops to introduce the coders to the data and tools available on the CGC. Topics included finding and using TCGA data in the CGC, developing portable software with Docker and the Common Workflow Language, and automating analysis with the CGC’s API.
By the end of Sunday, teams had wrapped up finished projects, with several teams planning to continue to develop them after the event.
Projects
Some great projects and products emerged over the 3 days. Here are some highlights:
- One team used TCGA gene expression data to understand genetic perturbations in colon cancer. They accessed TCGA files and metadata using the CGC, in this case gene expression data from RNA sequencing and DNA methylation arrays. Analysis revealed a mutational ‘hotspot’ (i.e. a region where a lot of cancer-causing mutations seem to occur). They plan to follow up this analysis and apply similar algorithms to other cancer types.
- Another team used gene expression data to build a learning tool. They created an interactive Jupyter notebook to help researchers do machine learning analyses of gene expression in custom cohorts of TCGA. Being able to quickly find and use TCGA data using the CGC enabled the team to build a tool to assist others in their data analysis.
- Students from the University of California Santa Cruz worked to deploy their lab’s Python code on the cloud. Their juncBASE software identifies alternative splicing events from RNA-Seq data. It has been used in TCGA papers but was difficult to share and reuse in its current form. The students adapted the software so that many different scripts now run using a single command, and packaged it in a Docker container, which can be easily shared with researchers anywhere. They plan to continue to optimize their code using the tools on CGC.
Wrapping up
We’d like to thank everyone who participated in the hackathon for their enthusiasm and positive feedback on the CGC. Congratulations are also due to our three travel grant winners: Goran Micevic, Michael Kleyman and Cameron Soulette.
This was our first hosted hackathon, and we were keen to see what kind of relationships and projects would develop. One of our aims was to broaden access to the CGC—it was therefore gratifying to see a mix of attendees getting stuck into their projects, from undergraduate students through to clinicians and professors. It was also great to be able to watch cancer research being done in real time.
#CGChack @genomicscloud @SBGenomics The hackaton is coming to an end! Thank you everyone for being here! pic.twitter.com/1x55LHQ59v
— Federica Torri (@GenomicFed) April 3, 2016
The CGC is available for researchers to use now—sign up here to get started. We’re also proud to have seen the CGC recognized as the Best of Show at Bio-IT World 2016. We’ll next be demonstrating the CGC at the AACR annual meeting in New Orleans, April 16–20, where attendees can get hands-on experience of how the cloud is working to progress cancer research.