Reference bias: Challenges and solutions
Ahead of the Bio-IT World Conference & Expo in Boston this week, we take a look at the role of reference genomes in ensuring accurate genomic analysis.
Reference bias leads to inaccurate genomic analysis
In standard next-generation sequencing analyses, DNA is fragmented and sequenced. The sequenced reads are then aligned to a reference genome for the species. Comparing the reads to the reference genome identifies sites of variation, which often have functional consequences.
Conventionally, read alignment is performed using a linear reference genome. Because a linear reference genome comprises a single sequence, either the genome of one individual or a consensus from a group of individuals, it does not capture the genomic diversity of a population. As a result, reference bias occurs: sample reads that highly differ from the reference do not map correctly, and are either mapped to the wrong position in the genome or remain unmapped altogether. Incorrect read mapping in turn leads to false negative or false positive variant calls.
Reference bias. When aligned to a linear reference genome, only a small proportion of sample reads containing an insertion (dark blue) are mapped to the correct position.
Reference bias has been characterized in a number of studies. Lunter and Goodson tested various mapping tools (BWA, MAQ, and Stampy) on reads from an individual from the 1000 Genomes Project, focusing on a known heterozygous indel in the genome. The mappers showed a consistent bias in mapping reads containing the reference allele, leading to underestimation of the proportion of reads with a non-reference allele. Degner and colleagues sequenced mRNA from two HapMap cell lines. When reads containing heterozygous SNPs were aligned to the reference genome using MAQ, reads with the reference allele were mapped significantly more than reads with the non-reference allele.
Reference bias limits clinical genomic research
Reference bias affects clinically significant regions of the genome. For example, because human leukocyte antigen (HLA) genes have the highest diversity of any region in the genome, HLA genotyping is susceptible to reference bias. When Brandt and colleagues compared HLA genotypes from the 1000 Genomes Project as measured by next-generation sequencing methods or gold-standard Sanger sequencing, they found that 18.6% of SNPs identified via next-generation sequencing variant calling were inaccurate. They also found evidence of reference bias—reads with an alternative allele at a SNP were less likely to map to the reference genome than were reads carrying the reference allele. Because reference bias results in inaccurate HLA typing by next-generation sequencing, genomic research is limited in the many clinical areas where HLA genes play an important role, from autoimmune diseases to organ transplant rejection.
Another area of clinical research impacted by reference bias is structural variant discovery in cancer genomes. Structural variants are relatively large genome alterations (tens to hundreds of bases or longer) and include deletions, insertions, tandem duplications, inversions, and translocations. Algorithms to identify structural variants largely rely on detecting patterns of discordant read pairs or split reads, an approach that depends on the accuracy of read mapping. Because of this, reference bias limits the detection of novel structural variants via next-generation sequencing, impacting the characterization of structural variants as drivers of tumor development.
Better references can improve genomic analysis accuracy
To address the problem of reference bias, one approach is to improve the reference genome so that it captures the genomic diversity of the entire population. Mapping and variant calling accuracy improves if reads are aligned to a representative collection of genomes, rather than a single linear genome.
The most recent releases of the human genome, GRCh37 and GRCh38, make a stride in this direction by including alternative loci in genomic regions with high diversity. However, making use of these data has proved challenging: most mapping tools still use one primary sequence as the reference, with limited capabilities to take into account alternative loci.
Alternatively, a diverse collection of genomes can be represented as an acyclic or cyclic sequence graph. Individual genomes in the population can then be identified as paths in the graph. Dilthey and colleagues applied this approach to the HLA region, constructing a directed acyclic graph using multiple data sources: primary and alternative sequences from GRCh37, SNPs identified by the 1000 Genomes Project, and sequences from the International Immunogenetics Information System. SNP inference using the graph reference had high accuracy, with a high concordance of SNPs identified by mapping sequencing data to the graph reference and via a SNP genotyping array. This study demonstrated how the diversity of a population can be captured in a single data structure, the graph reference genome.
Toward a better reference. Compared with alignment to a linear reference genome, alignment to a graph reference genome that captures population variation results in more sample reads that contain an insertion (dark blue) being mapped to the correct position.
Join the discussion at Bio-IT World
To learn more about how graph genome tools can help overcome the problems caused by reference bias, visit Seven Bridges at Bio-IT World this week, at booth 432 in the exhibit hall. Also check out our various presentations, including Director of Program Management Jack DiGiovanna’s talk about the applications of graph genome tools in precision medicine.