Developing an open standard for reproducible genomics pipelines
July was an exciting month for Seven Bridges Software Engineer Boysha Tijanic, who traveled from our Belgrade, Serbia office to give a presentation at ISMB 2014. In his talk, Boysha discussed the importance of transparency and thoroughness when it comes to sharing data and highlighted the complexity of bioinformatic workflows which, coupled with a continuously-updating array of tools and algorithms, make sharing and reproducing computational analysis of high throughput data challenging.
The call for reproducibility in genomics implies that pipelines, the series of tools and algorithms used to analyze genomic data, should be both scientifically sound and reproducible. Boysha described his work developing a simple protocol and set of open source tools to capture and disseminate computational analyses. This framework includes development of a software description schema, which in conjunction with shared tool images as docker containers, allows biomedical software and complete pipelines to be readily distributed, executed, and reproduced on any infrastructure. If you missed Boysha’s talk at ISMB, see his abstract, slides and a recent webinar recording below.
Developing a framework that is suitable for diverse tools and platforms is a hard problem that can only be achieved with input and buy-in from the bioinformatics community at large.
Interested in contributing? Fire an email to boysha@sbgenomics.com and look forward to future updates here.
Abstract
In his painting “Treachery of Images”, Magritte famously made the point that the painting of a pipe, in fact, is not the pipe itself. Yet, in the field of computational biology, we continue to publish our analyses as static manuscripts, although our digital networks enable us to share the artefacts themselves. Our papers should come with a similar warning, “Ceci n’est pas une analyse.” This is the treachery of papers.
Research groups are consistently struggle to reproduce data analyses from other published experiments. All too frequently, bioinformatics methods are not captured nor described in a sufficiently formal and accessible way. Software versions or exact parameters used are not published and datasets are hard to discover or access (by nesto). Furthermore, the practical tools to enable nontechnical third parties to disseminate this information is missing. With the advent of github, code sharing has increased, but that is not enough. Code is blueprints. We need the pipe itself.
To address these problems, we have developed Rabix, a set of open source tools and protocols to capture and disseminate computational analyses.
Our approach is to:
- Snapshot software binaries and all dependencies into a docker container image.
- Include a thin adapter layer in these images which enables the translation between a universal tool description language and the application software in the container, and provides a common interface to any scheduler / executor running the pipeline.
- Allow universal tool descriptors to be assembled together via a pipeline description language for the creation of pipelines in a functional and declarative manner.
- Create an executer that allows pipelines to run efficiently and reproducibly anywhere.
With Rabix, data, tools and pipelines can be published in open repositories which will enable the community to both host and reuse them on their own infrastructure. This way, we can share the analysis itself, show instead of tell, and create reproducible building blocks to further research.
Watch the webinar