Technical Report | Published:

De novo assembly and genotyping of variants using colored de Bruijn graphs

Nature Genetics volume 44, pages 226232 (2012) | Download Citation

Abstract

Detecting genetic variants that are highly divergent from a reference sequence remains a major challenge in genome sequencing. We introduce de novo assembly algorithms using colored de Bruijn graphs for detecting and genotyping simple and complex genetic variants in an individual or population. We provide an efficient software implementation, Cortex, the first de novo assembler capable of assembling multiple eukaryotic genomes simultaneously. Four applications of Cortex are presented. First, we detect and validate both simple and complex structural variations in a high-coverage human genome. Second, we identify more than 3 Mb of sequence absent from the human reference genome, in pooled low-coverage population sequence data from the 1000 Genomes Project. Third, we show how population information from ten chimpanzees enables accurate variant calls without a reference sequence. Last, we estimate classical human leukocyte antigen (HLA) genotypes at HLA-B, the most variable gene in the human genome.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

References

  1. 1.

    , , & Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).

  2. 2.

    & Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010).

  3. 3.

    , & Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008).

  4. 4.

    , , & SOAP: short oligonucleotide alignment program. Bioinformatics 24, 713–714 (2008).

  5. 5.

    & Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 21, 936–939 (2011).

  6. 6.

    et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

  7. 7.

    et al. Dindel: accurate indel calls from short-read data. Genome Res. 21, 961–973 (2011).

  8. 8.

    , , & MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions. Nat. Methods 6, 473–474 (2009).

  9. 9.

    et al. Detection and characterization of novel sequence insertions using paired-end next-generation sequencing. Bioinformatics 26, 1277–1283 (2010).

  10. 10.

    , , & Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nat. Genet. 43, 269–276 (2011).

  11. 11.

    et al. PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data. Genome Biol. 10, R23 (2009).

  12. 12.

    et al. Paired-end mapping reveals extensive structural variation in the human genome. Science 318, 420–426 (2007).

  13. 13.

    et al. Mapping copy number variation by population-scale genome sequencing. Nature 470, 59–65 (2011).

  14. 14.

    et al. Fine-scale structural variation of the human genome. Nat. Genet. 37, 727–732 (2005).

  15. 15.

    et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).

  16. 16.

    et al. The diploid genome sequence of an Asian individual. Nature 456, 60–65 (2008).

  17. 17.

    1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).

  18. 18.

    , & The cobweb of life revealed by genome-scale estimates of horizontal gene transfer. PLoS Biol. 3, e316 (2005).

  19. 19.

    , & Highways of gene sharing in prokaryotes. Proc. Natl. Acad. Sci. USA 102, 14332–14337 (2005).

  20. 20.

    et al. A multi-site study using high-resolution HLA genotyping by next generation sequencing. Tissue Antigens 77, 206–217 (2011).

  21. 21.

    et al. Second-generation environmental sequencing unmasks marine metazoan biodiversity. Nat. Commun. 1, 98 (2010).

  22. 22.

    et al. Detection of large-scale variation in the human genome. Nat. Genet. 36, 949–951 (2004).

  23. 23.

    et al. Global variation in copy number in the human genome. Nature 444, 444–454 (2006).

  24. 24.

    et al. Large-scale copy number polymorphism in the human genome. Science 305, 525–528 (2004).

  25. 25.

    et al. Segmental duplications and copy-number variation in the human genome. Am. J. Hum. Genet. 77, 78–88 (2005).

  26. 26.

    et al. Mapping and sequencing of structural variation from eight human genomes. Nature 453, 56–64 (2008).

  27. 27.

    Toward simplifying and accurately formulating fragment assembly. J. Comput. Biol. 2, 275–290 (1995).

  28. 28.

    The fragment assembly string graph. Bioinformatics 21 (suppl. 2), ii79–ii85 (2005).

  29. 29.

    & Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26, i367–i373 (2010).

  30. 30.

    & Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008).

  31. 31.

    et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl. Acad. Sci. USA 108, 1513–1518 (2011).

  32. 32.

    et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2010).

  33. 33.

    et al. The diploid genome sequence of Candida albicans. Proc. Natl. Acad. Sci. USA 101, 7329–7334 (2004).

  34. 34.

    et al. Assembly of polymorphic genomes: algorithms and application to Ciona savignyi. Genome Res. 15, 1127–1135 (2005).

  35. 35.

    , & Diploid genome reconstruction of Ciona intestinalis and comparative analysis with Ciona savignyi. Genome Res. 17, 1101–1110 (2007).

  36. 36.

    & Hapsembler: an assembler for highly polymorphic genomes. in Research in Computational Molecular Biology, Lecture Notes in Computer Science Vol. 6577 (eds. Bafna, V. & Sahinalp, S.), 38–52 (Springer, Berlin, Heidelberg, 2011).

  37. 37.

    , & An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA 98, 9748–9753 (2001).

  38. 38.

    & A new algorithm for DNA sequence assembly. J. Comput. Biol. 2, 291–306 (1995).

  39. 39.

    et al. ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009).

  40. 40.

    , , & Pebble and rock band: heuristic resolution of repeats and scaffolding in the velvet short-read de novo assembler. PLoS ONE 4, e8407 (2009).

  41. 41.

    et al. A human genome structural variation sequencing resource reveals insights into mutational mechanisms. Cell 143, 837–847 (2010).

  42. 42.

    et al. Drive against hotspot motifs in primates implicates the PRDM9 gene in meiotic recombination. Science 327, 876–879 (2010).

  43. 43.

    The International HapMap Consortium. et al. A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851–861 (2007).

  44. 44.

    et al. A high-resolution HLA and SNP haplotype map for disease association studies in the extended human MHC. Nat. Genet. 38, 1166–1172 (2006).

  45. 45.

    , , , & Calling SNPs without a reference sequence. BMC Bioinformatics 11, 130 (2010).

  46. 46.

    , , , & Identifying SNPs without a reference genome by comparing raw reads. in String Processing and Information Retrieval—17th International Symposium (eds. Chavez, E. & Lonardi, S.) 147–158 (Los Cabos, Mexico, 2010).

  47. 47.

    , , & Analysis of next-generation genomic data in cancer: accomplishments and challenges. Hum. Mol. Genet. 19, R188–R196 (2010).

  48. 48.

    et al. Evolution of MRSA during hospital transmission and intercontinental spread. Science 327, 469–474 (2010).

  49. 49.

    et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 18, 810–820 (2008).

  50. 50.

    , & De novo fragment assembly with short mate-paired reads: does the read length matter? Genome Res. 19, 336–346 (2009).

  51. 51.

    , & Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 11, R116 (2010).

  52. 52.

    et al. Sequence analysis of HLA-Bw53, a common West African allele, suggests an origin by gene conversion of HLA-B35. Hum. Immunol. 30, 105–109 (1991).

Download references

Acknowledgements

We would like to thank the members of the 1000 Genomes Project Consortium for discussion, suggestions and sequencing data. We thank B. Ahiska, A. Auton, E. Birney, R. Durbin, G. Lunter, J. Woolf and D. Zerbino for discussion, two anonymous reviewers for their comments and members of the PanMap Project and the Genomics Core at the Wellcome Trust Centre for Human Genetics for access to sequence data. Z.I. is funded by a grant from the Wellcome Trust (WT086084/Z/08/Z to G.M.). The sequencing of NA12878 was performed by the Wellcome Trust Sequencing Core at Oxford, under a grant from the Wellcome Trust (090532/Z/09/Z).

Author information

Author notes

    • Zamin Iqbal
    •  & Mario Caccamo

    These authors contributed equally to this work.

Affiliations

  1. Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK.

    • Zamin Iqbal
    • , Isaac Turner
    •  & Gil McVean
  2. European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, UK.

    • Zamin Iqbal
    •  & Paul Flicek
  3. The Genome Analysis Centre, Norwich Research Park, Norwich, UK.

    • Mario Caccamo
  4. Department of Statistics, University of Oxford, Oxford, UK.

    • Gil McVean

Authors

  1. Search for Zamin Iqbal in:

  2. Search for Mario Caccamo in:

  3. Search for Isaac Turner in:

  4. Search for Paul Flicek in:

  5. Search for Gil McVean in:

Contributions

Z.I. and G.M. designed the study, developed the mathematical models and wrote the manuscript. M.C. and Z.I. developed the variant discovery algorithms, designed the multicolor graph data structures and implemented software. Z.I. performed simulations and analyses for cases 1, 3 and 4. I.T. and Z.I. performed analyses for case 2. P.F. contributed to early plans for Cortex.

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to Gil McVean.

Supplementary information

PDF files

  1. 1.

    Supplementary Text and Figures

    Supplementary Note, Supplementary Figures 1–6 and Supplementary Tables 1–7

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/ng.1028

Further reading