Festival of Genomics London — Day 2
Thursday at the Festival
Yesterday, the Seven Bridges team and other delegates were back in action at the Festival of Genomics London. On another busy day, we focused our attention on the challenges of cancer data analysis.
Nazneen Rahman from the Institute of Cancer Research described the difficulties of turning cancer variant data into a clinically useful resource. As a doctor, the wish is to model genetic variant information as clinically implementable actions, where variants guide treatment decisions. For example, in breast cancer many novel BRCA variants can be classified as pathogenic vs nonpathogenic through automated processes. Importantly, the effects of variants differ according to context—for example, between familial and nonfamilial cancers—and these contexts also need to be included in variant effect databases.
Jan Korbel from EMBL spoke about the work of the germline working group from the Pan-Cancer Analysis of Whole Genomes (PCAWG) project. PCAWG is an attempt to standardize the world’s cancer data, which has been generated and processed in many different ways worldwide. The analysis has been done using distributed computation, with subsets of the data analysed on the cloud in conjunction with industry partners (including Seven Bridges). There is now a cloud release of ~1,300 germline cancer genomes, some via The Cancer Genome Atlas (TGCA), which is accessible through our Cancer Genomics Cloud.
Nils Gehlenborg from Harvard Medical School spoke on the role of data visualization in enhancing our understanding of the cancer genome. Visualization is an important tool for pattern discovery and hypothesis generation, but is difficult to apply to increasingly large datasets. He presented StratomeX, an interface for exploring complex datasets such as TGCA. A demonstration video is available here.
Discovery in Millions of Genomes
The morning plenary session concluded with Julia Fan Li from Seven Bridges UK speaking about our experience working with very large genomic data sets.
Now on stage, @juliafanli – not only is every speaker very accomplished but very engaging too! #genomicsfest pic.twitter.com/SxrT3uHvev
— Planetary Health Futurist (@ManeeshJuneja) January 21, 2016
The number of human sequences generated worldwide is increasing rapidly. Many large sequencing projects are underway, including the UK 100,000 Genomes Project and US Million Veteran Program. One million sequenced genomes is about to become a reality.
Julia Fan Li @SBGenomics says 1 million sequenced genomes coming soon. Big data infrastructure needed #genomicsfest pic.twitter.com/eFzzwzUbwW
— Andrew McConaghie (@awmcconaghie) January 21, 2016
Why do we need millions of genomes? Julia uses the example of cancer—where genomics is very much driving research, diagnosis and treatment. Discovering genetic variants associated with cancer is still very much a numbers game. More sequenced tumor–normal pairs leads to more identification of genes that are mutated at clinically important frequencies.
But sequencing an ever increasing number of human genomes creates challenges, for data acquisition, storage, distribution and analysis. In terms of difficulty, the challenges of genomics are at least on a par with other major big data enterprises: astronomy, Youtube and Twitter.
Several trends emerge from these challenges:
- First, computation centers will replace data repositories;
- Second, portable workflows will replace data transfers;
- Third, advanced data structures will replace static flat files.
https://twitter.com/SBGenomics/status/690108979368820736
Computation centers replace data repositories
Genomic data poses major challenges for storage and distribution. It will rapidly become impractical to download genomic data to work with locally. The alternative is to store data in a cloud environment and to bring algorithms to the data. This will become the norm.
#genomicsfest Julia Fan Li @SBGenomics: 2.5 petabytes of data req 23 days to download data. If something goes wrong, need to repeat
— Manuel Corpas (@manuelcorpas) January 21, 2016
Portable workflows replace data transfers
As algorithms are brought to data, the portability of workflows becomes more important. Instead of being detailed in a paper’s methods section, the exact pipelines used for an analysis (including version and parameter settings) will be recorded in simple text files that are easily shared and hyperlinked. This means that computational analyses will be transparent, and reproducible.
https://twitter.com/SBGenomics/status/690109972227694592
Advanced data structures replace static flat files
Current sequence and reference data is static, flat, and difficult to store and distribute. As we move towards a million sequenced genomes we must develop novel tools that better represent genetic variation and metadata. Graph genomes will capture the variation of a whole population and self-improve as more genomes become available.
https://twitter.com/SBGenomics/status/690111127850766336
This concept proved popular with the audience:
I love this idea! RT @manuelcorpas #genomicsfest Julia Fan Li @SBGenomics: a reference that learns and represents entire populations
— Becky Furlong (@becky_furlong) January 21, 2016
Graph Aligner
Finally, Julia invited the audience to apply for early access to our Graph Aligner. Our London-based team has been developing this tool with support from Genomics England as part of the 100,000 Genomes Project.
That’s it from the event in London. The Festival of Genomics will be back in Boston in June 2016.