Enabling Workflow Reproducibility in the Cloud with New Pipelines from the Genomic Data Commons
When analyzing genomic data, there is a vast range of bioinformatics tools and workflows to choose from. However, making an informed selection from so many options can be overwhelming, even within a relatively narrow topic, such as harmonization to a reference genome. One approach to selecting the right tool for your analysis is to use well-adopted methods developed and used by large institutions. The Genomic Data Commons (GDC) is the main repository for large public datasets on cancer, such as The Cancer Genome Atlas. It contains high quality, validated, curated, and annotated data from many cancer studies. The GDC also makes its workflows available, and these may be excellent candidates for best-practice workflows for your own work. These tools allow you to reproduce existing analyses and analyze your own data in the exact same manner as the large public datasets, which enables comparison studies and can dramatically increase the power of smaller datasets.
However, after you’ve identified a workflow, running it in your own hands on your own data may not be trivial. Many tools will not perform the same in different environments, which can make analyzing your own data alongside public or previously published data challenging. In addition, to compare your data to published data, you would also have to figure out how to download it, and find the processing power to drive the analysis and comparisons in a reasonable timeframe.
The Cancer Genomics Cloud (CGC), powered by Seven Bridges, makes this type of analysis easy and attainable by providing access to the data, computation to run the analyses, and necessary pipelines, all in one platform. Using the CGC, you can run workflows from a wide range of sources, including the GDC. These 400+ workflows are searchable and group by category so you can find them easily, without having to develop them yourself. By using Common Workflow Language (CWL), workflows are easily portable to the cloud, and the built-in controls and versioning ensure that the workflow will run the same every time.
Genomic Data Commons Workflows for Common Analyses
Three GDC Workflows that represent convenient options for common analyses are:
- DNASeq Harmonization Workflow
- RNASeq Workflow
- Tumor Only Variant Calling Workflow
All three of these tools can be run from the visual interface of the CGC or through the Python, R, or Java APIs. To help you plan your research, we’ve provided time and cost estimates for the workflows, though the actual values will depend on the size and complexity of the specific samples.
DNASeq Harmonization Workflow
The GDC DNASeq Harmonization workflow is used for harmonization of genomic data to the GRCh38 reference genome. This can be helpful when files within a data cohort are not aligned to the same reference, and it can be used to harmonize all data to the most up-to-date version of the genome. The steps within the workflow are to first convert a BAM input file to a FASTQ file, aligns the file to the GRCh38 reference genome using BWA, remove duplicates with Picard MarkDuplicates and processes the BAM file using GATK BQSR. The tool was modified by the Seven Bridges team for optimal performance in the cloud and for ease of use, without impacting the output of the tool. More info on running the DNASeq Harmonization workflow can be found here.
RNASeq Workflow
The GDC RNASeq workflow is used for alignment and quantification of RNA-Seq data to look at expression changes between samples. It takes either unmapped BAM or FASTQ files as inputs, uses STAR to align sequencing reads to the reference genome, and uses HTSeq to count sequencing reads based on the annotation file. Raw read counts are then normalized using two similar methods: FPKM and FPKM-UQ. More information on the RNASeq workflow can be found in the GDC documentation.
Tumor Only Variant Calling Workflow
The GDC Tumor Only Variant Calling workflow is used for variant calling on a tumor sample that does not have a matched normal sample. This method takes advantage of the normal cell contamination that is present in most tumor samples in order to differentiate between somatic and germline variation. These calls are made using the version of MuTect2 included in GATK4. Tumor-only variant call files can be found in the GDC Portal by filtering for “Workflow Type: GATK4 MuTect2”. The GDC Tumor Only Variant Calling Workflow harmonizes the file with VCF files in the TCGA dataset.
One feature of the Tumor Only Variant Calling workflow is that a Panel of Normals (PON) file is required for reference. You can add your own PON file that contains the appropriate references for your samples. Alternatively, you can access the PON file from the GDC directly from the GDC reference files if you have the necessary access privileges. The GDC command line parameters documentation has more information on choosing and obtaining the recommended workflow inputs.
These three workflows from the GDC demonstrate how you can use complex workflows without the need to set them up in your local environment. These and many other high-quality, best-practice tools and workflows can be accessed through the public apps gallery on the CGC. We are constantly adding new workflows, and we’d love to hear about what new tools could help in your research, such as additional workflows from the GDC or any other best practice workflows.