Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Dosage sensitivity and exon shuffling shape the landscape of polymorphic duplicates in Drosophila and humans

Abstract

Despite polymorphic duplicate genes' importance for the early stages of duplicate gene evolution, they are less studied than old gene duplicates. Two essential questions thus remain poorly addressed: how does dosage sensitivity, imposed by stoichiometry in protein complexes or by X chromosome dosage compensation, affect the emergence of complete duplicate genes? Do introns facilitate intergenic and intragenic chimaerism as predicted by the theory of exon shuffling? Here, we analysed new data for Drosophila and public data for humans, to characterize polymorphic duplicate genes with respect to dosage, exon–intron structures and allele frequencies. We found that complete duplicate genes are under dosage constraint induced by protein stoichiometry but potentially tolerated by X chromosome dosage compensation. We also found that in the intron-rich human genome, gene fusions and intragenic duplications extensively use intronic breakpoints generating in-frame proteins, in accordance with the theory of exon shuffling. Finally, we found that only a small proportion of complete or partial duplicates are at high frequencies, indicating the deleterious nature of dosage or gene structural changes. Altogether, we demonstrate how mechanistic factors including dosage sensitivity and exon–intron structure shape the short-term functional consequences of gene duplication.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Project design and evaluation of breakpoint inference.
Fig. 2: Type and distribution of duplicate genes.
Fig. 3: Expression changes and dosage constraints of duplicate genes.
Fig. 4: Pervasive chimaerism due to incomplete duplications.
Fig. 5: Internal duplications often lead to alternative splicing.
Fig. 6: Distribution of duplicate genes with respect to allele frequencies.

Similar content being viewed by others

Data availability

We released the raw and processed data. First, the resequencing data, RNA-seq data and Iso-seq data generated in this study were concurrently submitted to the NCBI BioProject database under accession numbers PRJNA681089, PRJNA681417 and PRJNA693662 and the Genome Sequence Archive in National Genomics Data Center (NGDC, part of the China National Center for Bioinformation) under accession numbers PRJCA004186, PRJCA001789 and PRJCA004319. Second, the processed fly data (duplication loci, DNA read alignment depth, RNA-seq alignments, assembled chimaeras, PacBio CCS reads and RT–PCR Sanger sequencing data) and human data (duplication loci and assembled chimaeras) were uploaded to the UCSC genome browser as public sessions ‘http://genome.ucsc.edu/cgi-bin/hgPublicSessions’ with names as ‘Drosophila Duplication’ and ‘Human Duplication’, respectively. We tried to make these tracks as useful as possible: (1) duplication ID refers to Supplementary Table 2; (2) when users click on each duplication ID, the corresponding lines or individuals harbouring this duplication are shown; and (3) assembled chimaeras were split based on breakpoints and separately aligned to the reference genome as Fig. 4c. Finally, the assembled contigs and the RT–PCR Sanger sequencing files were co-submitted to https://github.com/Zhanglab-IOZ/Polymorphic-Duplication and https://sandbox.zenodo.org/record/946570.

Code availability

The code for split-read-based duplication calling was submitted to: https://github.com/Zhanglab-IOZ/Polymorphic-Duplication. It is also archived at https://sandbox.zenodo.org/record/946570.

References

  1. Ohno, S. Evolution by Gene Duplication (Springer, 1970).

  2. Zhang, J. Evolution by gene duplication: an update. Trends Ecol. Evol. 18, 292–298 (2003).

    Article  Google Scholar 

  3. VanKuren, N. W. & Long, M. Gene duplicates resolving sexual conflict rapidly evolved essential gametogenesis functions. Nat. Ecol. Evol. 2, 705–712 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  4. Brooke, N. M., Garcia-Fernàndez, J. & Holland, P. W. The ParaHox gene cluster is an evolutionary sister of the Hox gene cluster. Nature 392, 920–922 (1998).

    Article  CAS  PubMed  Google Scholar 

  5. Bridges, C. B. Salivary chromosome maps with a key to the banding of the chromosomes of Drosophila melanogaster. J. Hered. 26, 60–64 (1935).

    Article  Google Scholar 

  6. Hahn, M. W. Distinguishing among evolutionary models for the maintenance of gene duplicates. J. Hered. 100, 605–617 (2009).

    Article  CAS  PubMed  Google Scholar 

  7. Innan, H. & Kondrashov, F. The evolution of gene duplications: classifying and distinguishing between models. Nat. Rev. Genet. 11, 97–108 (2010).

    Article  CAS  PubMed  Google Scholar 

  8. Kondrashov, F. A. Gene duplication as a mechanism of genomic adaptation to a changing environment. Proc. R. Soc. B https://doi.org/10.1098/rspb.2012.1108 (2012).

  9. Force, A. et al. Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151, 1531 (1999).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Holland, P. W., Marlétaz, F., Maeso, I., Dunwell, T. L. & Paps, J. New genes from old: asymmetric divergence of gene duplicates and the evolution of development. Phil. Trans. R. Soc. B 372, 20150480 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  11. Rice, A. M. & McLysaght, A. Dosage-sensitive genes in evolution and disease. BMC Biol. 15, 78 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  12. Giorgianni, M. W. et al. The origin and diversification of a novel protein family in venomous snakes. Proc. Natl Acad. Sci. USA 117, 10911–10920 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Guruharsha, K. G. et al. A protein complex network of Drosophila melanogaster. Cell 147, 690–703 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Birchler, J. A. & Veitia, R. A. Gene balance hypothesis: connecting issues of dosage sensitivity across biological disciplines. Proc. Natl Acad. Sci. USA 109, 14746–14753 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Qian, W., Liao, B.-Y., Chang, A. Y.-F. & Zhang, J. Maintenance of duplicate genes and their functional redundancy by reduced expression. Trends Genet. 26, 425–430 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Lan, X. & Pritchard, J. K. Coregulation of tandem duplicate genes slows evolution of subfunctionalization in mammals. Science 352, 1009–1013 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Chang, A. Y.-F. & Liao, B.-Y. Recruitment of histone modifications to assist mRNA dosage maintenance after degeneration of cytosine DNA methylation during animal evolution. Genome Res 27, 1513–1524 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Sangrithi, M. N. et al. Non-canonical and sexually dimorphic X dosage compensation states in the mouse and human germline. Dev. Cell 40, 289–301 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Lucchesi, J. C. & Kuroda, M. I. Dosage compensation in Drosophila. Cold Spring Harb. Perspect. Biol. 7, a019398 (2015).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  20. Emerson, J. J., Cardoso-Moreira, M., Borevitz, J. O. & Long, M. Natural selection shapes genome-wide patterns of copy-number polymorphism in Drosophila melanogaster. Science 320, 1629–1631 (2008).

    Article  CAS  PubMed  Google Scholar 

  21. Dougherty, M. L. et al. Transcriptional fates of human-specific segmental duplications in brain. Genome Res 28, 1566–1576 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Rogers, R. L. & Hartl, D. L. Chimeric genes as a source of rapid evolution in Drosophila melanogaster. Mol. Biol. Evol. 29, 517–529 (2012).

    Article  CAS  PubMed  Google Scholar 

  23. Williford, A. & Betrán, E. Gene Fusion (eLS, 2013); https://doi.org/10.1002/9780470015902.a0005099.pub3

  24. Kondrashov, F. A. & Koonin, E. V. Origin of alternative splicing by tandem exon duplication. Hum. Mol. Genet. 10, 2661–2669 (2001).

    Article  CAS  PubMed  Google Scholar 

  25. Letunic, I., Copley, R. R. & Bork, P. Common exon duplication in animals and its role in alternative splicing. Hum. Mol. Genet. 11, 1561–1567 (2002).

    Article  CAS  PubMed  Google Scholar 

  26. Gao, X. & Lynch, M. Ubiquitous internal gene duplication and intron creation in eukaryotes. Proc. Natl Acad. Sci. USA 106, 20818–20823 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Gilbert, W. Why genes in pieces. Nature 271, 501 (1978).

    Article  CAS  PubMed  Google Scholar 

  28. Gilbert, W. & Long, M. Walter Gilbert: Selected Works (World Scientific Publishing Company, 2020).

  29. Irimia, M. & Roy, S. W. Origin of spliceosomal introns and alternative splicing. Cold Spring Harb. Perspect. Biol. https://doi.org/10.1101/cshperspect.a016071 (2014).

  30. Keren, H., Lev-Maor, G. & Ast, G. Alternative splicing and evolution: diversification, exon definition and function. Nat. Rev. Genet. 11, 345–355 (2010).

    Article  CAS  PubMed  Google Scholar 

  31. Smithers, B., Oates, M. & Gough, J. ‘Why genes in pieces?’—revisited. Nucleic Acids Res. 47, 4970–4973 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Roy, S. W. & Gilbert, W. The evolution of spliceosomal introns: patterns, puzzles and progress. Nat. Rev. Genet. 7, 211–221 (2006).

    PubMed  Google Scholar 

  33. Liu, M. & Grigoriev, A. Protein domains correlate strongly with exons in multiple eukaryotic genomes—evidence of exon shuffling? Trends Genet. 20, 399–403 (2004).

    Article  PubMed  CAS  Google Scholar 

  34. Patthy, L. Genome evolution and the evolution of exon-shuffling—a review. Gene 238, 103–114 (1999).

    Article  CAS  PubMed  Google Scholar 

  35. Redon, R. et al. Global variation in copy number in the human genome. Nature 444, 444–454 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Chiang, C. et al. The impact of structural variation on human gene expression. Nat. Genet. 49, 692 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Tuke, M. et al. Large copy-number variants in UK Biobank caused by clonal hematopoiesis may confound penetrance estimates. Am. J. Hum. Genet. 107, 325–329 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Carvalho, C. M. & Lupski, J. R. Mechanisms underlying structural variant formation in genomic disorders. Nat. Rev. Genet. 17, 224 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Sudmant, P. H. et al. Evolution and diversity of copy number variation in the great ape lineage. Genome Res. 23, 1373–1382 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Schrider, D. R., Hahn, M. W. & Begun, D. J. Parallel evolution of copy-number variation across continents in Drosophila melanogaster. Mol. Biol. Evol. 33, 1308–1316 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Newman, S., Hermetz, K. E., Weckselblatt, B. & Rudd, M. K. Next-generation sequencing of duplication CNVs reveals that most are tandem and some create fusion genes at breakpoints. Am. J. Hum. Genet. 96, 208–220 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Cardoso-Moreira, M. et al. Evidence for the fixation of gene duplications by positive selection in Drosophila. Genome Res. 26, 787–798 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Rogers, R. L., Shao, L. & Thornton, K. R. Tandem duplications lead to novel expression patterns through exon shuffling in Drosophila yakuba. PLoS Genet. 13, e1006795 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  44. Konrad, A. et al. Mutational and transcriptional landscape of spontaneous gene duplications and deletions in Caenorhabditis elegans. Proc. Natl Acad. Sci. USA 115, 7386–7391 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Graur, D. & Li, W. H. Fundamentals of Molecular Evolution (Sinauer, 2000).

  46. Adams, M. D. et al. The genome sequence of Drosophila melanogaster. Science 287, 2185–2195 (2000).

    Article  PubMed  Google Scholar 

  47. Sakharkar, M. K., Perumal, B. S., Sakharkar, K. R. & Kangueane, P. An analysis on gene architecture in human and mouse genomes. In Silico Biol. 5, 347–365 (2005).

    CAS  PubMed  Google Scholar 

  48. Deutsch, M. & Long, M. Intron–exon structures of eukaryotic model organisms. Nucleic Acids Res. 27, 3219–3228 (1999).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Logsdon, G. A., Vollger, M. R. & Eichler, E. E. Long-read human genome sequencing and its applications. Nat. Rev. Genet 21, 597–614 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Ranz, J. & Clifton, B. Characterization and evolutionary dynamics of complex regions in eukaryotic genomes. Sci. China Life Sci. 62, 467–488 (2019).

    Article  PubMed  Google Scholar 

  51. Mackay, T. F. et al. The Drosophila melanogaster genetic reference panel. Nature 482, 173–178 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Zichner, T. et al. Impact of genomic structural variation in Drosophila melanogaster based on population-scale sequencing. Genome Res. https://doi.org/10.1101/gr.142646.112 (2013).

  53. Huang, W. et al. Natural variation in genome architecture among 205 Drosophila melanogaster Genetic Reference Panel lines. Genome Res. 24, 1193–1208 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Chen, K. et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat. Methods 6, 677 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Ye, K., Schulz, M. H., Long, Q., Apweiler, R. & Ning, Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25, 2865–2871 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Abyzov, A., Urban, A. E., Snyder, M. & Gerstein, M. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 21, 974–984 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Rausch, T. et al. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333–i339 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Layer, R. M., Chiang, C., Quinlan, A. R. & Hall, I. M. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 15, R84 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  59. Ardlie, K. G. et al. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015).

    Article  CAS  Google Scholar 

  60. Katju, V. In with the old, in with the new: the promiscuity of the duplication process engenders diverse pathways for novel gene creation. Int. J. Evol. Biol. 2012, 341932 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  61. Zhang, W. Y., Landback, P., Gschwend, A. R., Shen, B. R. & Long, M. Y. New genes drive the evolution of gene interaction networks in the human and mouse genomes. Genome Biol 16, 202 (2015).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  62. Loehlin, D. W. & Carroll, S. B. Expression of tandem gene duplicates is often greater than twofold. Proc. Natl Acad. Sci. USA 113, 5988–5992 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Stark, A. et al. Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature 450, 219–232 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Alekseyenko, A. A. et al. A sequence motif within chromatin entry sites directs MSL establishment on the Drosophila X chromosome. Cell 134, 599–609 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  65. Bachtrog, D., Toda, N. R. & Lockton, S. Dosage compensation and demasculinization of X chromosomes in Drosophila. Curr. Biol. 20, 1476–1481 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Pandey, R. S., Wilson Sayres, M. A. & Azad, R. K. Detecting evolutionary strata on the human X chromosome in the absence of gametologous Y-linked sequences. Genome Biol. Evol. 5, 1863–1871 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  67. Carrel, L. & Willard, H. F. X-inactivation profile reveals extensive variability in X-linked gene expression in females. Nature 434, 400–404 (2005).

    Article  CAS  PubMed  Google Scholar 

  68. Berletch, J. B., Yang, F., Xu, J., Carrel, L. & Disteche, C. M. Genes that escape from X inactivation. Hum. Genet. 130, 237–245 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  69. Shvetsova, E. et al. Skewed X-inactivation is common in the general female population. Eur. J. Hum. Genet. 27, 455–465 (2019).

    Article  CAS  PubMed  Google Scholar 

  70. Ji, J. et al. Copy number gain of VCX, X-linked multi-copy gene, leads to cell proliferation and apoptosis during spermatogenesis. Oncotarget 7, 78532–78540 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  71. Zhang, Y., Liu, X. S., Liu, Q. R. & Wei, L. P. Genome-wide in silico identification and analysis of cis natural antisense transcripts (cis-NATs) in ten species. Nucleic Acids Res. 34, 3465–3475 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  72. Clark, M. B. et al. The reality of pervasive transcription. PLoS Biol. 9, e1000625 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  73. Pertea, M. et al. CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Genome Biol 19, 208 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  74. Long, M. & Langley, C. H. Natural selection and the origin of jingwei, a chimeric processed functional gene in Drosophila. Science 260, 91–95 (1993).

    Article  CAS  PubMed  Google Scholar 

  75. Amrani, N., Sachs, M. S. & Jacobson, A. Early nonsense: mRNA decay solves a translational problem. Nat. Rev. Mol. Cell Biol. 7, 415–425 (2006).

    Article  CAS  PubMed  Google Scholar 

  76. Baker, E. P. & Hittinger, C. T. Evolution of a novel chimeric maltotriose transporter in Saccharomyces eubayanus from parent proteins unable to perform this function. PLoS Genet. https://doi.org/10.1371/journal.pgen.1007786 (2019).

  77. Cooper, G. M., Nickerson, D. A. & Eichler, E. E. Mutational and selective effects on copy-number variants in the human genome. Nat. Genet. 39, S22–S29 (2007).

    Article  CAS  PubMed  Google Scholar 

  78. Rigau, M., Juan, D., Valencia, A. & Rico, D. Intronic CNVs and gene expression variation in human populations. PLoS Genet. 15, e1007902 (2019).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  79. Lynch, M. The Origins of Genome Architecture (Sinauer Associates, 2007).

  80. Walsh, J. B. How often do duplicated genes evolve new functions? Genetics 139, 421–428 (1995).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  81. Long, M. Y., VanKuren, N. W., Chen, S. D. & Vibranovski, M. D. New gene evolution: little did we know. Annu. Rev. Genet. 47, 307–333 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  82. Rosenberg, S. M. & Queitsch, C. Combating evolution to fight disease. Science 343, 1088–1089 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  83. Richardson, M. F. et al. Population genomics of the Wolbachia endosymbiont in Drosophila melanogaster. PLoS Genet. 8, e1003129 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  84. Mu, J. C. et al. Fast and accurate read alignment for resequencing. Bioinformatics 28, 2366–2373 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  85. Clifton, B. D. et al. Understanding the early evolutionary stages of a tandem Drosophila melanogaster-specific gene family: a structural and functional population study. Mol. Biol. Evol. https://doi.org/10.1093/molbev/msaa109 (2020).

  86. Rhead, B. et al. The UCSC Genome Browser database: update 2010. Nucleic Acids Res. 38, D613–D619 (2010).

    Article  CAS  PubMed  Google Scholar 

  87. Cardoso-Moreira, M., Emerson, J. J., Clark, A. G. & Long, M. Drosophila duplication hotspots are associated with late-replicating regions of the genome. PLoS Genet. 7, e1002340 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  88. Rogers, R. L. et al. Landscape of standing variation for tandem duplications in Drosophila yakuba and Drosophila simulans. Mol. Biol. Evol. 31, 1750–1766 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  89. Alkan, C., Coe, B. P. & Eichler, E. E. Genome structural variation discovery and genotyping. Nat. Rev. Genet. 12, 363 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  90. Manuel Rodriguez, J. et al. APPRIS: annotation of principal and alternative splice isoforms. Nucleic Acids Res. 41, D110–D117 (2013).

    Article  CAS  Google Scholar 

  91. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  92. Van der Auwera, G. A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics 43, 11.10.1–11.10.33 (2013).

    Google Scholar 

  93. Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w(1118); iso-2; iso-3. Fly 6, 80–92 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  94. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

    Article  CAS  PubMed  Google Scholar 

  95. Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).

    Article  CAS  PubMed  Google Scholar 

  96. Ma, Y. et al. Genome-wide analysis of pseudogenes reveals HBBP1’s human-specific essentiality in erythropoiesis and implication in beta-thalassemia. Dev. Cell https://doi.org/10.1016/j.devcel.2020.12.019 (2021).

  97. Shao, Y. et al. GenTree, an integrated resource for analyzing the evolution and function of primate-specific coding genes. Genome Res. 29, 682–696 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  98. Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinf. 12, 323 (2011).

    Article  CAS  Google Scholar 

  99. Yanai, I. et al. Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification. Bioinformatics 21, 650–659 (2005).

    Article  CAS  PubMed  Google Scholar 

  100. Yang, H. et al. Expression profile and gene age jointly shaped the genome-wide distribution of premature termination codons in a Drosophila melanogaster population. Mol. Biol. Evol. 32, 216–228 (2015).

    Article  CAS  PubMed  Google Scholar 

  101. Grabherr, M. G. et al. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nat. Biotechnol. 29, 644 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  102. Kent, W. J. BLAT–the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  103. Lin, K., Smit, S., Bonnema, G., Sanchez-Perez, G. & de Ridder, D. Making the difference: integrating structural variation detection tools. Brief. Bioinform. 16, 852–864 (2015).

    Article  PubMed  Google Scholar 

  104. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  105. Durbin, R. M. et al. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).

    Article  CAS  PubMed  Google Scholar 

  106. Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  107. Zhou, J. X. et al. Identification of KANSARL as the first cancer predisposition fusion gene specific to the population of European ancestry origin. Oncotarget 8, 50594–50607 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  108. Oliver, G. R., Jenkinson, G. & Klee, E. W. Computational detection of known pathogenic gene fusions in a normal tissue database and implications for genetic disease research. Front. Genet. 11, 173 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  109. Watterson, G. A. On the number of segregating sites in genetical models without recombination. Theor. Pop. Biol. 7, 256–276 (1975).

    Article  CAS  Google Scholar 

  110. Cardoso-Moreira, M., Arguello, J. R. & Clark, A. G. Mutation spectrum of Drosophila CNVs revealed by breakpoint sequencing. Genome Biol. 13, R119 (2012).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  111. Katju, V. & Bergthorsson, U. Old trade, new tricks: insights into the spontaneous mutation process from the partnering of classical mutation accumulation experiments with high-throughput genomic approaches. Genome Biol. Evol. 11, 136–165 (2018).

    Article  PubMed Central  CAS  Google Scholar 

  112. Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  113. Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  114. Giurgiu, M. et al. CORUM: the comprehensive resource of mammalian protein complexes—2019. Nucleic Acids Res. 47, D559–D563 (2019).

    Article  CAS  PubMed  Google Scholar 

  115. Kinsella, R. J. et al. Ensembl BioMarts: a hub for data retrieval across taxonomic space. Database 2011, bar030 (2011).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

Download references

Acknowledgements

We thank the DGRP team and the GTEx team for generating and releasing the data. We thank M. Cardoso-Moreira for improving the writing. We thank M. Long, B. He, X. Lan, W. Qian, B. Guo and Zhang Laboratory (Institute of Zoology (IOZ), Chinese Academy of Sciences) members for the helpful discussions about the project. We thank Y. Fu, Y. Ding, G. Gao and Z. Wang for the computational help. We thank J. Wang and X. Huang for experimental help. This research was supported by the National Key R&D Program of China (2018YFC1406902 and 2019YFA0802600), the Chinese Academy of Sciences (ZDBS-LY-SM005, XBZG-ZDSYS-201913 and XDPB17), the National Natural Science Foundation of China (31771410 and 31970565) and the Open Research Program of the Chinese Institute for Brain Research; all were granted to Y.E.Z. The computing was jointly supported by the HPC Platform of BIG and that of the Scientific Information Centre of IOZ.

Author information

Authors and Affiliations

Authors

Contributions

Y.E.Z. conceived and designed the study. D.Z. performed the experimental analyses with the help of J.W.H. and Y.Q.Z. D.Z. and L.L. performed the computational analyses with the help of C.Y.C., H.Y., C.Y.M. and H.C. Y.E.Z., D.Z. and L.L. wrote the paper.

Corresponding author

Correspondence to Yong E. Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review information

Nature Ecology & Evolution thanks Ben-Yang Liao and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Phylogeny and features of 31 DGRP core lines.

We ordered 40 DGRP core lines from the Bloomington Stock Center but could only maintain 31 of these lines at the moment of sequencing. Lines infected with Wolbachia, lines harbouring inversions and lines with a fraction of segregating variants in at least one chromosome arm higher than 2% are marked in green, pink and blue, respectively. Among the remaining lines, six less related lines (marked in bold black) were subsequently used.

Extended Data Fig. 2 Heatmap of 108 tissue-specific genes.

The pheatmap package was used to generate this figure, in which expression was rescaled for each gene to highlight the tissue-specificity.

Extended Data Fig. 3 Size distribution of different types of duplicates and protein-coding genes.

a) All (top) duplications and the duplications spanning a single gene (bottom). b) Length distribution of all duplications relative to the length of all protein-coding genes. The median values are shown with each group. P values were calculated with Wilcoxon signed rank tests.

Extended Data Fig. 4 Proportion distribution in the shuffling dataset.

a) Proportion of complete/partial/intronic duplicates. b) Proportion of six arrangements of partial duplicates. C) Proportion of intronic breakpoints. We performed 100 simulations. For each simulation, we calculated the proportion of complete, partial and intronic duplicates. For partial duplicates, we further calculated the proportion of six subtypes. For each subtype, we also calculated the proportion of intronic breakpoints. We summarized 100 values as a boxplot. The median simulated values and observed values are shown along each group. For each plot, we marked the observed value as a dark black line and showed the empirical P value as the times out of 100 replicates where the observed value is more extreme. For example, in the case of partial duplicates, the observed proportion is 58.9%, which is only lower than that in 12 out of 100 random replicates (top left panel). The corresponding P value would be 0.12.

Extended Data Fig. 5 Tissue-level transcriptional fold-change distribution for complete duplicates.

The convention of this figure follows Fig. 3a. For each gene in each tissue, we calculated the fold-change with the median expression in duplication-present and duplication-absent lines or individuals. In flies, each tissue at least covers 19 transcribed (TPM > 1) genes. In humans, only 11 tissues or cell lines covering at least 15 transcribed genes were shown. Wilcoxon signed rank test was used to estimate the statistical significance relative to the expected 100% upregulation (top panel) and 50% upregulation (bottom panel). Despite some moderate fluctuation, the extent of upregulation generally does not deviate from the expectation (for 6 out of 7 tissues in flies, 11 out of 11 in humans). FFB, MFB and AG refer to female fat bodies, male fat bodies and accessory glands. Note that for AG, the deviation is significantly different from 100% simply because of the distribution of data most of which are smaller than 100% (purple curve). By contrast, FFB is not significant despite the smaller median (48% versus 54%).

Extended Data Fig. 6 The distribution of duplicates across different chromosomes and distribution of distances to HASs.

a) Chromosomal distribution of duplicates. We only included the major autosomes (Drosophila: 2 L, 2 R, 3 L, 3 R; Human: 1-22) and the X chromosome in the calculation. The proportion of annotated coding genes on each chromosome in the reference genome is used as the control (BG or background). The P was calculated with the one-sided proportion test. b) distances to HASs. The median distance (28,687 bp) of 8 complete duplicates relative to the closest HAS was highlighted with a green line. We generated 10,000 random samples and calculated the empirical P as the percentage of how many times random samples showed a shorter median distance. The blue line indicates the median value of random samples. The two median values are also shown.

Extended Data Fig. 7 Five fusion transcripts supported by PacBio reads and example of frameshift in KANSL1-ARL17A.

a) CG2818-CG31955 fusion. b) Rpll215-CG11697-Kmn1 fusion. c) 5’-Intergenic duplication of PpN58A. d) CG4069-Pmm2-sowah fusion. e) Subreads supporting three new introns of Sherpa-CG2469. F) KANSL1-ARL17A fusion. For Panels A to E, five out of 11 chimeric genes assembled in the testis samples of RAL-379 were confirmed with PacBio data. For A) to D), only the longest assembled contig was shown for simplicity. In A) to E), the conventions follow those in Fig. 4c. Similar to Fig. 4c, CCS reads sometimes encode novel exon/intron structures. Only the intron in the CG2818-CG31955 fusion transcript harbours the standard splicing signal (GT-AG, Panel A). Six subreads in Panel E consistently exhibit three novel introns, as shown in Fig. 4c. Note that different CCS reads have different sequence qualities, as determined by the number of low-quality supporting subreads (for example, Panel E). For example, CCS-2 has a higher quality than CCS-1 in Panel B. In Panel F, the duplicated region is framed with dashed lines. The fusion transcript involves a 2-bp intronic sequence (the splicing site, ‘gt’) causing the frameshift. Two neighbouring codons are also shown.

Extended Data Fig. 8 Sequence features of 5’-3’ fusion genes.

CDS refers to coding sequence. Whether a frameshift occurs was inferred based on the codons adjacent to the breakpoints. Domains are shown by blue (the leading gene) or red (the lagging gene) rectangles, and zigzags show mid-domain breaks. Note: the fused region of the leading gene for CG17387-ctrip, that of the lagging gene for Sherpa-CG2469 and both regions for TFDP2-XRN1 do not harbour annotated domains. The lengths of the rectangles are roughly proportional to the relative domain size within each gene. ‘IntronU’ and ‘intron’ refer to introns located within UTRs and introns within coding regions, respectively. For Prosbeta5R2-CG5681, Prosbeta5R2 was completely duplicated, whereas the 3’ part of CG5681 was duplicated. After fusion, Prosbeta5R2 harbours a longer 3’ UTR by incorporating a CG5681-derived sequence.

Extended Data Fig. 9 Sequence and transcription features of internally duplicated genes.

a) Partial duplication of the 5’ UTR in CG3409. CG3409 encodes different 5’ UTRs, one of which is duplicated and a longer 5’ UTR was generated. Because this 5’ exon is alternative the original coding sequence is presumably not affected. b) Post-duplication intron retention of TRAF3IP3. According to the duplication boundaries (highlighted in light blue), two contigs indicate inclusive transcripts where exons and flanking introns are simultaneously transcribed. c) Out-of-frame transcript of DENND5A. The duplicated region is framed with dashed line boxes. Codons are separated with commas, and amino acids are shown accordingly. The incompatibility of codon phases causes the frameshift. d) Read depth-based quantification of isoforms (see also Methods). Two examples (CG9663, SPG11) are shown. Given only 10 data points for the duplication-present lines of SPG11, we showed all specific values as grey dots. For each gene, we calculated the ratio of the read depth (duplicated region versus flanking region) in lines with or without duplication. For these two examples, the ratio found for lines with duplications was significantly (one-sided Wilcoxon signed rank test P ≤ 0.05) higher than that in lines without duplications, supporting the presence of inclusive isoforms. According to the median ratios, we further calculated the relative expression ratio of inclusive and exclusive transcripts (Fig. 5f). E) qRT–PCR-based quantification of isoforms in flies. Whole-body samples of duplication-absent lines were used as controls. Primers targeting inclusive transcripts did not show signals in the duplication-absent lines, and primers targeting both inclusive and exclusive transcripts generally generated weaker signals than their counterparts in the duplication-present lines. The error bar represents the standard deviation based on three technical replicates.

Extended Data Fig. 10 The frequency distribution of duplicates classified by the median size and proportions of duplicates across different allele frequency groups.

a) The upper and lower panels show all duplications and the duplications spanning a single gene, respectively. For each type of duplicates, we divided them into large (L) and small (S) groups according to the median size. The Wilcoxon signed rank test was used to compare the median frequencies in these two groups. Small duplicates generally show higher frequencies than large duplicates in humans. Moreover, despite the small sample size of fly dataset, larger duplications also seem to show lower allele frequencies. b) Allele frequency spectra of complete, partial and intronic duplicates in addition to PTCs and synonymous substitutions in Drosophila (the left panel) and humans (the right panel). Only duplications spanning a single gene were plotted. c) Proportion of complete, partial and intronic duplicates within different frequency groups. In the left panel, all duplicate loci were plotted with six-line allele frequency data. In the middle and right panels, only duplications spanning a single gene were plotted. The figure conventions of Panel B and C follow those of Fig. 6a and b, respectively.

Supplementary information

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, D., Leng, L., Chen, C. et al. Dosage sensitivity and exon shuffling shape the landscape of polymorphic duplicates in Drosophila and humans. Nat Ecol Evol 6, 273–287 (2022). https://doi.org/10.1038/s41559-021-01614-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41559-021-01614-w

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing