Docker-based solutions to reproducibility in science
Seven Bridges Genomics is launching a toolkit for creating fully portable bioinformatics workflows, based on Docker and the Common Workflow Language.
With high-profile cases of data fabrication recently in the news, the New York Times asks how journal referees can detect fraudulent results in submitted articles when they are rarely given access to the raw data used in them. Generally, the only way to tell whether results have been fabricated or achieved by chance is to repeat the experiment used to obtain them; for this reason, reproducibility is a crucial part of the scientific method. Only last week an article in PLOS estimated that in the United States alone, $28 billion per year is spent on preclinical research that is not reproducible. This has resulted in delays to lifesaving treatments and driven up treatment costs for patients. While the PLOS article proposes ways to overcome obstacles to reproducibility in biology, in this post, we’ll look at some of the emerging technical solutions to the problem for in data science, including Seven Bridges’ own contribution.
The case reported in the New York Times centered on a paper by Lacour and Green. The paper claimed that, remarkably, people’s attitudes towards same-sex marriage could be significantly altered following a scripted interview with a canvasser. But the alleged results contained some statistical anomalies, which were discovered and reported by two PhD students, David Broockman and Joshua Kalla, and Peter Aronow, an assistant professor.
What makes Broockman, Kalla and Aronow’s paper so convincing is that it was produced using knitr, a report-generating tool in which the data and the text are knitted together into a single distributable package. Knitr allows authors to embed chunks of code written in the programming language R, along with links to raw data, into the text of their articles. Then, when the article is compiled as a pdf or html, the charts used are automatically built and inserted. Anyone with the article’s source code can produce the very same charts. Reports produced with knitr are prime examples of reproducible research — a new paradigm in data science publications, in which the raw data and code required to produce the results of an article are packaged and published together.
Organizations such as the Science Exchange are springing up to address best practices for reproducible research, and there is a growing movement calling on journals to impose standards for it. In response, Nature asked its authors to state the availability and location of any custom code that is central to the claims made in their publication, but stopped short of insisting that code be made available, citing “diversity of practices”. As the reproducible research movement grows, we can expect to see journals putting increasing pressure on researchers to provide the code required to reproduce their results.
Even when code is made available, however, scientists face the problem of deciding what to infer from a failed replication. To use a historical example, when Mendel repeated his cross-pollination experiments with a new generation of pea plants, occasionally he didn’t find the 1:3 ratio of recessive to dominant phenotypes: his results were not always reproducible. However, rather than scrapping his laws of inheritance, Mendel inferred that in those cases the plants must have been pollinated by stray pollen blown in on the breeze, rather than self-pollinating as he had expected. Scientific laws tend to be like this; they describe what we can expect to happen when things are not too wildly different from the environment under which they were first observed. So, in order to ensure that failures of replication are scientifically meaningful, and not the result of outside influences, scientists carefully control their test environments. This is no less true for science based on computation. It means that when a journal referee fails to replicate the result of a data analysis workflow, claimed in a submitted article, she needs to be able to determine whether the failure is due to a difference in runtime environments or whether it indicates an illegitimate conclusion.
Problematically, in addition to the obvious environmental influences on computation—such as the number of CPUs available, or the operating system—subtle and seemingly-unrelated processes on a user’s computer can determine the behaviour of scientific software. A Nature article from 2012 gives an example in which the outcome is shown to depend on whether or not the user has printed debugging statements. The authors conclude with a proposal: “To maximize the chances of reproducibility and consistency, not only would we urge code release, but also a description of the hardware and software environment in which the program was executed and developed.”
The request that researchers provide a description of their runtime environment along with the code itself is sensible enough, but in practice, it is complicated and time-consuming to ensure that a given description is matched remotely. The remote colleague has to ensure that she has installed the correct version of Ubuntu, allocated the right memory resources, and so on. And even then, she has no guarantee that more subtle differences between the two machines won’t influence software behaviour.
Fortunately, since the Nature 2012 article was published, a better solution has emerged: Docker. This allows users to package up and distribute entire computational environments. Bioinformaticians have been quick to recognize Docker’s potential for reproducibility of data analysis workflows: in recent blog posts, Brad Chapman from the Harvard School of Public Health and C. Titus Brown of Michigan State University have spoken positively of the technology (Heng Li from Broad is a little more nuanced). Nature Biotechnology now recommends Docker as a method for tool-sharing.
At Seven Bridges Genomics, we’ve been taking advantage of Docker containers to deploy bioinformatics tools to our cloud-based computation platform. This means that we can ensure that the behaviour of any custom-written software on a scientist’s local Linux machine is reproduced precisely on the cloud. And, when used in conjunction with the Common Workflow Language (CWL), containerization allows us to produce fully portable data workflows. Our new open-source project, Rabix.org, provides a simple way to do this. The site features a tool editor that is used to specify the interface of individual data analysis tools, each of which is installed into a separate container, and a pipeline editor, used to describe the way the tools are chained together to make a workflow. This information is then used to produce a CWL specification of the workflow. Anyone with the specification can replicate the workflow, down to the precise parameterization of each tool, in its original computational environment. The procedure renders workflows fully portable, and represents real progress for the reproducible research movement.
All of this technology has also been built into Seven Bridges’ software development kit, which we are excited to be launching very soon. Expect to hear more from our engineering team on this in the near future.