WASP: allele-specific software for robust molecular quantitative trait locus discovery

Journal name:
Nature Methods
Volume:
12,
Pages:
1061–1063
Year published:
DOI:
doi:10.1038/nmeth.3582
Received
Accepted
Published online

Allele-specific sequencing reads provide a powerful signal for identifying molecular quantitative trait loci (QTLs), but they are challenging to analyze and are prone to technical artifacts. Here we describe WASP, a suite of tools for unbiased allele-specific read mapping and discovery of molecular QTLs. Using simulated reads, RNA-seq reads and chromatin immunoprecipitation sequencing (ChIP-seq) reads, we demonstrate that WASP has a low error rate and is far more powerful than existing QTL-mapping approaches.

At a glance

Figures

  1. Mapping of allele-specific reads.
    Figure 1: Mapping of allele-specific reads.

    (a) Mapping to personalized genomes can result in allelic bias because reads from one allele might not map uniquely. (b) Read-mapping pipeline to remove allelic bias. (c) The percentage of simulated 100-bp reads at heterozygous sites where a read with one allele mapped correctly and the corresponding read with the other allele did not. Reads were simulated with sequencing errors introduced at several different rates. (d) The fraction of false positives as a function of the effect size determined using a nominal Benjamini-Hochberg FDR of 10% (yellow dashed line). We simulated 100-bp allele-specific reads under null (odds ratio = 1) and alternative models (odds ratio > 1) of allelic imbalance at heterozygous sites in the genome. We assumed that 90% and 10% of sites were null and alternative sites, respectively. We mapped reads using WASP, personal-genome (AlleleSeq10) or N-masked–genome mapping strategies and called allele-specific sites using a binomial test.

  2. The combined haplotype test and its performance.
    Figure 2: The combined haplotype test and its performance.

    (a) A test SNP is tested for association with mapped reads within a target region. All reads are used by the read-depth component of the test; allele-specific reads are used by the allelic-imbalance component of the test. (b) Identification of novel QTLs using H3K27ac ChIP-seq data from ten Yoruba LCLs. (c) Identification of European eQTLs from the GEUVADIS consortium using an independent RNA-seq data set from 69 Yoruba LCLs. Red dashed lines in b and c represent the null values.

  3. WASP mapping errors at heterozygous sites.
    Supplementary Fig. 1: WASP mapping errors at heterozygous sites.

    WASP mapping errors at heterozygous sites as a function of the rate of unknown single-nucleotide variants (SNVs).

  4. Quantile-quantile plots of ranked -log10 P values from the combined haplotype test.
    Supplementary Fig. 2: Quantile-quantile plots of ranked –log10 P values from the combined haplotype test.

    (a) Ranked –log10 P values from running the combined haplotype test on H3K27ac ChIP-seq data from ten lymphoblastoid cell lines compared to P values expected under the null hypothesis. The permuted points are for same data set, but with the genotypes of each SNP shuffled. (b) Ranked –log10 P values from running the combined haplotype test on RNA-seq data from 69 YRI cell lines. The test was run only on eQTLs that were previously identified in cell lines derived from European individuals1. The permuted points are for the same data set, but with the genotypes of each SNP shuffled.

    1. Lappalainen, T. et al. Nature 501, 506–511 (2013).

  5. Receiver operating characteristic (ROC) curves showing the performance of five methods for QTL identification on simulated data.
    Supplementary Fig. 3: Receiver operating characteristic (ROC) curves showing the performance of five methods for QTL identification on simulated data.

    Performance for different numbers of individuals and effect sizes. The simulations are described in Supplementary Note 6.

  6. The WASP mapping pipeline.
    Supplementary Fig. 4: The WASP mapping pipeline.

    Reads are first mapped to the genome using a mapping tool of the user’s choice. The aligned reads are provided to WASP in SAM (sequence alignment/map) or BAM (binary alignment/map) format, along with a list of known polymorphisms. WASP identifies reads that overlap known polymorphisms, flips the alleles in the reads, and remaps them to the genome. Reads that map to a different location than the original read are then discarded. Finally, WASP can optionally remove reads that map to the same genomic location (‘duplicate reads’) without introducing a reference bias.

  7. The WASP combined haplotype test pipeline.
    Supplementary Fig. 5: The WASP combined haplotype test pipeline.

    Mapped reads (in BAM or SAM format) for each individual, genotypes for known SNPs, and a list of regions and SNPs to test are provided to WASP. WASP extracts read counts for the target regions as well as allele-specific read counts. Read counts from multiple sources can be used to update heterozygous probabilities. Expected read counts for each region are adjusted through modeling of the relationships between read counts and GC content and between read counts and total read counts for each sample. Dispersion parameters are estimated from the data and provided to the combined haplotype test along with the read counts. Principal components can optionally be used as covariates by the test.

References

  1. Degner, J.F. et al. Nature 482, 390394 (2012).
  2. Montgomery, S.B. et al. Nature 464, 773777 (2010).
  3. Pickrell, J.K. et al. Nature 464, 768 (2010).
  4. Skelly, D.A., Johansson, M., Madeoy, J., Wakefield, J. & Akey, J.M. Genome Res. 21, 17281737 (2011).
  5. Harvey, C.T. et al. Bioinformatics 31, 12351242 (2015).
  6. Sun, W. Biometrics 68, 111 (2012).
  7. Degner, J.F. et al. Bioinformatics 25, 32073212 (2009).
  8. Panousis, N.I., Gutierrez-Arcelus, M., Dermitzakis, E.T. & Lappalainen, T. Genome Biol. 15, 467 (2014).
  9. Anders, S. & Huber, W. Genome Biol. 11, R106 (2010).
  10. Rozowsky, J. et al. Mol. Syst. Biol. 7, 522 (2011).
  11. Liu, Z. et al. Genet. Epidemiol. 38, 591598 (2014).
  12. Roberts, A. & Pachter, L. Nat. Methods 10, 7173 (2013).
  13. Turro, E. et al. Genome Biol. 12, R13 (2011).
  14. Li, H. et al. Bioinformatics 25, 20782079 (2009).
  15. Benjamini, Y. & Speed, T.P. Nucleic Acids Res. 40, e72 (2012).
  16. McVicker, G. et al. Science 342, 747749 (2013).
  17. Lappalainen, T. et al. Nature 501, 506511 (2013).
  18. Katz, Y., Wang, E.T., Airoldi, E.M. & Burge, C.B. Nat. Methods 7, 10091015 (2010).
  19. Trapnell, C. et al. Nat. Biotechnol. 31, 4653 (2013).

Download references

Author information

  1. These authors contributed equally to this work.

    • Bryce van de Geijn &
    • Graham McVicker

Affiliations

  1. Department of Human Genetics, University of Chicago, Chicago, Illinois, USA.

    • Bryce van de Geijn &
    • Yoav Gilad
  2. Committee on Genetics, Genomics and Systems Biology, University of Chicago, Chicago, Illinois, USA.

    • Bryce van de Geijn
  3. Department of Genetics, Stanford University, Stanford, California, USA.

    • Graham McVicker &
    • Jonathan K Pritchard
  4. Department of Biology, Stanford University, Stanford, California, USA.

    • Jonathan K Pritchard
  5. Howard Hughes Medical Institute, Stanford University, Stanford, California, USA.

    • Jonathan K Pritchard

Contributions

B.v.d.G., G.M., J.K.P. and Y.G. conceived of the project. B.v.d.G. and G.M. performed the analyses and implemented the software. G.M. and B.v.d.G. wrote the manuscript with input from all authors. J.K.P. and Y.G. directed the project.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Author details

Supplementary information

Supplementary Figures

  1. Supplementary Figure 1: WASP mapping errors at heterozygous sites. (95 KB)

    WASP mapping errors at heterozygous sites as a function of the rate of unknown single-nucleotide variants (SNVs).

  2. Supplementary Figure 2: Quantile-quantile plots of ranked –log10 P values from the combined haplotype test. (99 KB)

    (a) Ranked –log10 P values from running the combined haplotype test on H3K27ac ChIP-seq data from ten lymphoblastoid cell lines compared to P values expected under the null hypothesis. The permuted points are for same data set, but with the genotypes of each SNP shuffled. (b) Ranked –log10 P values from running the combined haplotype test on RNA-seq data from 69 YRI cell lines. The test was run only on eQTLs that were previously identified in cell lines derived from European individuals1. The permuted points are for the same data set, but with the genotypes of each SNP shuffled.

    1. Lappalainen, T. et al. Nature 501, 506–511 (2013).

  3. Supplementary Figure 3: Receiver operating characteristic (ROC) curves showing the performance of five methods for QTL identification on simulated data. (263 KB)

    Performance for different numbers of individuals and effect sizes. The simulations are described in Supplementary Note 6.

  4. Supplementary Figure 4: The WASP mapping pipeline. (119 KB)

    Reads are first mapped to the genome using a mapping tool of the user’s choice. The aligned reads are provided to WASP in SAM (sequence alignment/map) or BAM (binary alignment/map) format, along with a list of known polymorphisms. WASP identifies reads that overlap known polymorphisms, flips the alleles in the reads, and remaps them to the genome. Reads that map to a different location than the original read are then discarded. Finally, WASP can optionally remove reads that map to the same genomic location (‘duplicate reads’) without introducing a reference bias.

  5. Supplementary Figure 5: The WASP combined haplotype test pipeline. (122 KB)

    Mapped reads (in BAM or SAM format) for each individual, genotypes for known SNPs, and a list of regions and SNPs to test are provided to WASP. WASP extracts read counts for the target regions as well as allele-specific read counts. Read counts from multiple sources can be used to update heterozygous probabilities. Expected read counts for each region are adjusted through modeling of the relationships between read counts and GC content and between read counts and total read counts for each sample. Dispersion parameters are estimated from the data and provided to the combined haplotype test along with the read counts. Principal components can optionally be used as covariates by the test.

PDF files

  1. Supplementary Text and Figures (3,472 KB)

    Supplementary Figures 1–5, Supplementary Table 1 and Supplementary Notes 1–8

Zip files

  1. Supplementary Software (705 KB)

    WASP code and documentation. Updated files are maintained at https://github.com/bmvdgeijn/WASP

Additional data