Abstract
Despite polymorphic duplicate genes' importance for the early stages of duplicate gene evolution, they are less studied than old gene duplicates. Two essential questions thus remain poorly addressed: how does dosage sensitivity, imposed by stoichiometry in protein complexes or by X chromosome dosage compensation, affect the emergence of complete duplicate genes? Do introns facilitate intergenic and intragenic chimaerism as predicted by the theory of exon shuffling? Here, we analysed new data for Drosophila and public data for humans, to characterize polymorphic duplicate genes with respect to dosage, exon–intron structures and allele frequencies. We found that complete duplicate genes are under dosage constraint induced by protein stoichiometry but potentially tolerated by X chromosome dosage compensation. We also found that in the intron-rich human genome, gene fusions and intragenic duplications extensively use intronic breakpoints generating in-frame proteins, in accordance with the theory of exon shuffling. Finally, we found that only a small proportion of complete or partial duplicates are at high frequencies, indicating the deleterious nature of dosage or gene structural changes. Altogether, we demonstrate how mechanistic factors including dosage sensitivity and exon–intron structure shape the short-term functional consequences of gene duplication.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
Paralog transcriptional differentiation in the D. melanogaster-specific gene family Sdic across populations and spermatogenesis stages
Communications Biology Open Access 20 October 2023
-
Pan-cancer surveys indicate cell cycle-related roles of primate-specific genes in tumors and embryonic cerebrum
Genome Biology Open Access 06 December 2022
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout






Data availability
We released the raw and processed data. First, the resequencing data, RNA-seq data and Iso-seq data generated in this study were concurrently submitted to the NCBI BioProject database under accession numbers PRJNA681089, PRJNA681417 and PRJNA693662 and the Genome Sequence Archive in National Genomics Data Center (NGDC, part of the China National Center for Bioinformation) under accession numbers PRJCA004186, PRJCA001789 and PRJCA004319. Second, the processed fly data (duplication loci, DNA read alignment depth, RNA-seq alignments, assembled chimaeras, PacBio CCS reads and RT–PCR Sanger sequencing data) and human data (duplication loci and assembled chimaeras) were uploaded to the UCSC genome browser as public sessions ‘http://genome.ucsc.edu/cgi-bin/hgPublicSessions’ with names as ‘Drosophila Duplication’ and ‘Human Duplication’, respectively. We tried to make these tracks as useful as possible: (1) duplication ID refers to Supplementary Table 2; (2) when users click on each duplication ID, the corresponding lines or individuals harbouring this duplication are shown; and (3) assembled chimaeras were split based on breakpoints and separately aligned to the reference genome as Fig. 4c. Finally, the assembled contigs and the RT–PCR Sanger sequencing files were co-submitted to https://github.com/Zhanglab-IOZ/Polymorphic-Duplication and https://sandbox.zenodo.org/record/946570.
Code availability
The code for split-read-based duplication calling was submitted to: https://github.com/Zhanglab-IOZ/Polymorphic-Duplication. It is also archived at https://sandbox.zenodo.org/record/946570.
References
Ohno, S. Evolution by Gene Duplication (Springer, 1970).
Zhang, J. Evolution by gene duplication: an update. Trends Ecol. Evol. 18, 292–298 (2003).
VanKuren, N. W. & Long, M. Gene duplicates resolving sexual conflict rapidly evolved essential gametogenesis functions. Nat. Ecol. Evol. 2, 705–712 (2018).
Brooke, N. M., Garcia-Fernàndez, J. & Holland, P. W. The ParaHox gene cluster is an evolutionary sister of the Hox gene cluster. Nature 392, 920–922 (1998).
Bridges, C. B. Salivary chromosome maps with a key to the banding of the chromosomes of Drosophila melanogaster. J. Hered. 26, 60–64 (1935).
Hahn, M. W. Distinguishing among evolutionary models for the maintenance of gene duplicates. J. Hered. 100, 605–617 (2009).
Innan, H. & Kondrashov, F. The evolution of gene duplications: classifying and distinguishing between models. Nat. Rev. Genet. 11, 97–108 (2010).
Kondrashov, F. A. Gene duplication as a mechanism of genomic adaptation to a changing environment. Proc. R. Soc. B https://doi.org/10.1098/rspb.2012.1108 (2012).
Force, A. et al. Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151, 1531 (1999).
Holland, P. W., Marlétaz, F., Maeso, I., Dunwell, T. L. & Paps, J. New genes from old: asymmetric divergence of gene duplicates and the evolution of development. Phil. Trans. R. Soc. B 372, 20150480 (2017).
Rice, A. M. & McLysaght, A. Dosage-sensitive genes in evolution and disease. BMC Biol. 15, 78 (2017).
Giorgianni, M. W. et al. The origin and diversification of a novel protein family in venomous snakes. Proc. Natl Acad. Sci. USA 117, 10911–10920 (2020).
Guruharsha, K. G. et al. A protein complex network of Drosophila melanogaster. Cell 147, 690–703 (2011).
Birchler, J. A. & Veitia, R. A. Gene balance hypothesis: connecting issues of dosage sensitivity across biological disciplines. Proc. Natl Acad. Sci. USA 109, 14746–14753 (2012).
Qian, W., Liao, B.-Y., Chang, A. Y.-F. & Zhang, J. Maintenance of duplicate genes and their functional redundancy by reduced expression. Trends Genet. 26, 425–430 (2010).
Lan, X. & Pritchard, J. K. Coregulation of tandem duplicate genes slows evolution of subfunctionalization in mammals. Science 352, 1009–1013 (2016).
Chang, A. Y.-F. & Liao, B.-Y. Recruitment of histone modifications to assist mRNA dosage maintenance after degeneration of cytosine DNA methylation during animal evolution. Genome Res 27, 1513–1524 (2017).
Sangrithi, M. N. et al. Non-canonical and sexually dimorphic X dosage compensation states in the mouse and human germline. Dev. Cell 40, 289–301 (2017).
Lucchesi, J. C. & Kuroda, M. I. Dosage compensation in Drosophila. Cold Spring Harb. Perspect. Biol. 7, a019398 (2015).
Emerson, J. J., Cardoso-Moreira, M., Borevitz, J. O. & Long, M. Natural selection shapes genome-wide patterns of copy-number polymorphism in Drosophila melanogaster. Science 320, 1629–1631 (2008).
Dougherty, M. L. et al. Transcriptional fates of human-specific segmental duplications in brain. Genome Res 28, 1566–1576 (2018).
Rogers, R. L. & Hartl, D. L. Chimeric genes as a source of rapid evolution in Drosophila melanogaster. Mol. Biol. Evol. 29, 517–529 (2012).
Williford, A. & Betrán, E. Gene Fusion (eLS, 2013); https://doi.org/10.1002/9780470015902.a0005099.pub3
Kondrashov, F. A. & Koonin, E. V. Origin of alternative splicing by tandem exon duplication. Hum. Mol. Genet. 10, 2661–2669 (2001).
Letunic, I., Copley, R. R. & Bork, P. Common exon duplication in animals and its role in alternative splicing. Hum. Mol. Genet. 11, 1561–1567 (2002).
Gao, X. & Lynch, M. Ubiquitous internal gene duplication and intron creation in eukaryotes. Proc. Natl Acad. Sci. USA 106, 20818–20823 (2009).
Gilbert, W. Why genes in pieces. Nature 271, 501 (1978).
Gilbert, W. & Long, M. Walter Gilbert: Selected Works (World Scientific Publishing Company, 2020).
Irimia, M. & Roy, S. W. Origin of spliceosomal introns and alternative splicing. Cold Spring Harb. Perspect. Biol. https://doi.org/10.1101/cshperspect.a016071 (2014).
Keren, H., Lev-Maor, G. & Ast, G. Alternative splicing and evolution: diversification, exon definition and function. Nat. Rev. Genet. 11, 345–355 (2010).
Smithers, B., Oates, M. & Gough, J. ‘Why genes in pieces?’—revisited. Nucleic Acids Res. 47, 4970–4973 (2019).
Roy, S. W. & Gilbert, W. The evolution of spliceosomal introns: patterns, puzzles and progress. Nat. Rev. Genet. 7, 211–221 (2006).
Liu, M. & Grigoriev, A. Protein domains correlate strongly with exons in multiple eukaryotic genomes—evidence of exon shuffling? Trends Genet. 20, 399–403 (2004).
Patthy, L. Genome evolution and the evolution of exon-shuffling—a review. Gene 238, 103–114 (1999).
Redon, R. et al. Global variation in copy number in the human genome. Nature 444, 444–454 (2006).
Chiang, C. et al. The impact of structural variation on human gene expression. Nat. Genet. 49, 692 (2017).
Tuke, M. et al. Large copy-number variants in UK Biobank caused by clonal hematopoiesis may confound penetrance estimates. Am. J. Hum. Genet. 107, 325–329 (2020).
Carvalho, C. M. & Lupski, J. R. Mechanisms underlying structural variant formation in genomic disorders. Nat. Rev. Genet. 17, 224 (2016).
Sudmant, P. H. et al. Evolution and diversity of copy number variation in the great ape lineage. Genome Res. 23, 1373–1382 (2013).
Schrider, D. R., Hahn, M. W. & Begun, D. J. Parallel evolution of copy-number variation across continents in Drosophila melanogaster. Mol. Biol. Evol. 33, 1308–1316 (2016).
Newman, S., Hermetz, K. E., Weckselblatt, B. & Rudd, M. K. Next-generation sequencing of duplication CNVs reveals that most are tandem and some create fusion genes at breakpoints. Am. J. Hum. Genet. 96, 208–220 (2015).
Cardoso-Moreira, M. et al. Evidence for the fixation of gene duplications by positive selection in Drosophila. Genome Res. 26, 787–798 (2016).
Rogers, R. L., Shao, L. & Thornton, K. R. Tandem duplications lead to novel expression patterns through exon shuffling in Drosophila yakuba. PLoS Genet. 13, e1006795 (2017).
Konrad, A. et al. Mutational and transcriptional landscape of spontaneous gene duplications and deletions in Caenorhabditis elegans. Proc. Natl Acad. Sci. USA 115, 7386–7391 (2018).
Graur, D. & Li, W. H. Fundamentals of Molecular Evolution (Sinauer, 2000).
Adams, M. D. et al. The genome sequence of Drosophila melanogaster. Science 287, 2185–2195 (2000).
Sakharkar, M. K., Perumal, B. S., Sakharkar, K. R. & Kangueane, P. An analysis on gene architecture in human and mouse genomes. In Silico Biol. 5, 347–365 (2005).
Deutsch, M. & Long, M. Intron–exon structures of eukaryotic model organisms. Nucleic Acids Res. 27, 3219–3228 (1999).
Logsdon, G. A., Vollger, M. R. & Eichler, E. E. Long-read human genome sequencing and its applications. Nat. Rev. Genet 21, 597–614 (2020).
Ranz, J. & Clifton, B. Characterization and evolutionary dynamics of complex regions in eukaryotic genomes. Sci. China Life Sci. 62, 467–488 (2019).
Mackay, T. F. et al. The Drosophila melanogaster genetic reference panel. Nature 482, 173–178 (2012).
Zichner, T. et al. Impact of genomic structural variation in Drosophila melanogaster based on population-scale sequencing. Genome Res. https://doi.org/10.1101/gr.142646.112 (2013).
Huang, W. et al. Natural variation in genome architecture among 205 Drosophila melanogaster Genetic Reference Panel lines. Genome Res. 24, 1193–1208 (2014).
Chen, K. et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat. Methods 6, 677 (2009).
Ye, K., Schulz, M. H., Long, Q., Apweiler, R. & Ning, Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25, 2865–2871 (2009).
Abyzov, A., Urban, A. E., Snyder, M. & Gerstein, M. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 21, 974–984 (2011).
Rausch, T. et al. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333–i339 (2012).
Layer, R. M., Chiang, C., Quinlan, A. R. & Hall, I. M. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 15, R84 (2014).
Ardlie, K. G. et al. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015).
Katju, V. In with the old, in with the new: the promiscuity of the duplication process engenders diverse pathways for novel gene creation. Int. J. Evol. Biol. 2012, 341932 (2012).
Zhang, W. Y., Landback, P., Gschwend, A. R., Shen, B. R. & Long, M. Y. New genes drive the evolution of gene interaction networks in the human and mouse genomes. Genome Biol 16, 202 (2015).
Loehlin, D. W. & Carroll, S. B. Expression of tandem gene duplicates is often greater than twofold. Proc. Natl Acad. Sci. USA 113, 5988–5992 (2016).
Stark, A. et al. Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature 450, 219–232 (2007).
Alekseyenko, A. A. et al. A sequence motif within chromatin entry sites directs MSL establishment on the Drosophila X chromosome. Cell 134, 599–609 (2008).
Bachtrog, D., Toda, N. R. & Lockton, S. Dosage compensation and demasculinization of X chromosomes in Drosophila. Curr. Biol. 20, 1476–1481 (2010).
Pandey, R. S., Wilson Sayres, M. A. & Azad, R. K. Detecting evolutionary strata on the human X chromosome in the absence of gametologous Y-linked sequences. Genome Biol. Evol. 5, 1863–1871 (2013).
Carrel, L. & Willard, H. F. X-inactivation profile reveals extensive variability in X-linked gene expression in females. Nature 434, 400–404 (2005).
Berletch, J. B., Yang, F., Xu, J., Carrel, L. & Disteche, C. M. Genes that escape from X inactivation. Hum. Genet. 130, 237–245 (2011).
Shvetsova, E. et al. Skewed X-inactivation is common in the general female population. Eur. J. Hum. Genet. 27, 455–465 (2019).
Ji, J. et al. Copy number gain of VCX, X-linked multi-copy gene, leads to cell proliferation and apoptosis during spermatogenesis. Oncotarget 7, 78532–78540 (2016).
Zhang, Y., Liu, X. S., Liu, Q. R. & Wei, L. P. Genome-wide in silico identification and analysis of cis natural antisense transcripts (cis-NATs) in ten species. Nucleic Acids Res. 34, 3465–3475 (2006).
Clark, M. B. et al. The reality of pervasive transcription. PLoS Biol. 9, e1000625 (2011).
Pertea, M. et al. CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Genome Biol 19, 208 (2018).
Long, M. & Langley, C. H. Natural selection and the origin of jingwei, a chimeric processed functional gene in Drosophila. Science 260, 91–95 (1993).
Amrani, N., Sachs, M. S. & Jacobson, A. Early nonsense: mRNA decay solves a translational problem. Nat. Rev. Mol. Cell Biol. 7, 415–425 (2006).
Baker, E. P. & Hittinger, C. T. Evolution of a novel chimeric maltotriose transporter in Saccharomyces eubayanus from parent proteins unable to perform this function. PLoS Genet. https://doi.org/10.1371/journal.pgen.1007786 (2019).
Cooper, G. M., Nickerson, D. A. & Eichler, E. E. Mutational and selective effects on copy-number variants in the human genome. Nat. Genet. 39, S22–S29 (2007).
Rigau, M., Juan, D., Valencia, A. & Rico, D. Intronic CNVs and gene expression variation in human populations. PLoS Genet. 15, e1007902 (2019).
Lynch, M. The Origins of Genome Architecture (Sinauer Associates, 2007).
Walsh, J. B. How often do duplicated genes evolve new functions? Genetics 139, 421–428 (1995).
Long, M. Y., VanKuren, N. W., Chen, S. D. & Vibranovski, M. D. New gene evolution: little did we know. Annu. Rev. Genet. 47, 307–333 (2013).
Rosenberg, S. M. & Queitsch, C. Combating evolution to fight disease. Science 343, 1088–1089 (2014).
Richardson, M. F. et al. Population genomics of the Wolbachia endosymbiont in Drosophila melanogaster. PLoS Genet. 8, e1003129 (2012).
Mu, J. C. et al. Fast and accurate read alignment for resequencing. Bioinformatics 28, 2366–2373 (2012).
Clifton, B. D. et al. Understanding the early evolutionary stages of a tandem Drosophila melanogaster-specific gene family: a structural and functional population study. Mol. Biol. Evol. https://doi.org/10.1093/molbev/msaa109 (2020).
Rhead, B. et al. The UCSC Genome Browser database: update 2010. Nucleic Acids Res. 38, D613–D619 (2010).
Cardoso-Moreira, M., Emerson, J. J., Clark, A. G. & Long, M. Drosophila duplication hotspots are associated with late-replicating regions of the genome. PLoS Genet. 7, e1002340 (2011).
Rogers, R. L. et al. Landscape of standing variation for tandem duplications in Drosophila yakuba and Drosophila simulans. Mol. Biol. Evol. 31, 1750–1766 (2014).
Alkan, C., Coe, B. P. & Eichler, E. E. Genome structural variation discovery and genotyping. Nat. Rev. Genet. 12, 363 (2011).
Manuel Rodriguez, J. et al. APPRIS: annotation of principal and alternative splice isoforms. Nucleic Acids Res. 41, D110–D117 (2013).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Van der Auwera, G. A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics 43, 11.10.1–11.10.33 (2013).
Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w(1118); iso-2; iso-3. Fly 6, 80–92 (2012).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).
Ma, Y. et al. Genome-wide analysis of pseudogenes reveals HBBP1’s human-specific essentiality in erythropoiesis and implication in beta-thalassemia. Dev. Cell https://doi.org/10.1016/j.devcel.2020.12.019 (2021).
Shao, Y. et al. GenTree, an integrated resource for analyzing the evolution and function of primate-specific coding genes. Genome Res. 29, 682–696 (2019).
Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinf. 12, 323 (2011).
Yanai, I. et al. Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification. Bioinformatics 21, 650–659 (2005).
Yang, H. et al. Expression profile and gene age jointly shaped the genome-wide distribution of premature termination codons in a Drosophila melanogaster population. Mol. Biol. Evol. 32, 216–228 (2015).
Grabherr, M. G. et al. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nat. Biotechnol. 29, 644 (2011).
Kent, W. J. BLAT–the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).
Lin, K., Smit, S., Bonnema, G., Sanchez-Perez, G. & de Ridder, D. Making the difference: integrating structural variation detection tools. Brief. Bioinform. 16, 852–864 (2015).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Durbin, R. M. et al. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915 (2019).
Zhou, J. X. et al. Identification of KANSARL as the first cancer predisposition fusion gene specific to the population of European ancestry origin. Oncotarget 8, 50594–50607 (2017).
Oliver, G. R., Jenkinson, G. & Klee, E. W. Computational detection of known pathogenic gene fusions in a normal tissue database and implications for genetic disease research. Front. Genet. 11, 173 (2020).
Watterson, G. A. On the number of segregating sites in genetical models without recombination. Theor. Pop. Biol. 7, 256–276 (1975).
Cardoso-Moreira, M., Arguello, J. R. & Clark, A. G. Mutation spectrum of Drosophila CNVs revealed by breakpoint sequencing. Genome Biol. 13, R119 (2012).
Katju, V. & Bergthorsson, U. Old trade, new tricks: insights into the spontaneous mutation process from the partnering of classical mutation accumulation experiments with high-throughput genomic approaches. Genome Biol. Evol. 11, 136–165 (2018).
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
Giurgiu, M. et al. CORUM: the comprehensive resource of mammalian protein complexes—2019. Nucleic Acids Res. 47, D559–D563 (2019).
Kinsella, R. J. et al. Ensembl BioMarts: a hub for data retrieval across taxonomic space. Database 2011, bar030 (2011).
Acknowledgements
We thank the DGRP team and the GTEx team for generating and releasing the data. We thank M. Cardoso-Moreira for improving the writing. We thank M. Long, B. He, X. Lan, W. Qian, B. Guo and Zhang Laboratory (Institute of Zoology (IOZ), Chinese Academy of Sciences) members for the helpful discussions about the project. We thank Y. Fu, Y. Ding, G. Gao and Z. Wang for the computational help. We thank J. Wang and X. Huang for experimental help. This research was supported by the National Key R&D Program of China (2018YFC1406902 and 2019YFA0802600), the Chinese Academy of Sciences (ZDBS-LY-SM005, XBZG-ZDSYS-201913 and XDPB17), the National Natural Science Foundation of China (31771410 and 31970565) and the Open Research Program of the Chinese Institute for Brain Research; all were granted to Y.E.Z. The computing was jointly supported by the HPC Platform of BIG and that of the Scientific Information Centre of IOZ.
Author information
Authors and Affiliations
Contributions
Y.E.Z. conceived and designed the study. D.Z. performed the experimental analyses with the help of J.W.H. and Y.Q.Z. D.Z. and L.L. performed the computational analyses with the help of C.Y.C., H.Y., C.Y.M. and H.C. Y.E.Z., D.Z. and L.L. wrote the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review information
Nature Ecology & Evolution thanks Ben-Yang Liao and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Phylogeny and features of 31 DGRP core lines.
We ordered 40 DGRP core lines from the Bloomington Stock Center but could only maintain 31 of these lines at the moment of sequencing. Lines infected with Wolbachia, lines harbouring inversions and lines with a fraction of segregating variants in at least one chromosome arm higher than 2% are marked in green, pink and blue, respectively. Among the remaining lines, six less related lines (marked in bold black) were subsequently used.
Extended Data Fig. 2 Heatmap of 108 tissue-specific genes.
The pheatmap package was used to generate this figure, in which expression was rescaled for each gene to highlight the tissue-specificity.
Extended Data Fig. 3 Size distribution of different types of duplicates and protein-coding genes.
a) All (top) duplications and the duplications spanning a single gene (bottom). b) Length distribution of all duplications relative to the length of all protein-coding genes. The median values are shown with each group. P values were calculated with Wilcoxon signed rank tests.
Extended Data Fig. 4 Proportion distribution in the shuffling dataset.
a) Proportion of complete/partial/intronic duplicates. b) Proportion of six arrangements of partial duplicates. C) Proportion of intronic breakpoints. We performed 100 simulations. For each simulation, we calculated the proportion of complete, partial and intronic duplicates. For partial duplicates, we further calculated the proportion of six subtypes. For each subtype, we also calculated the proportion of intronic breakpoints. We summarized 100 values as a boxplot. The median simulated values and observed values are shown along each group. For each plot, we marked the observed value as a dark black line and showed the empirical P value as the times out of 100 replicates where the observed value is more extreme. For example, in the case of partial duplicates, the observed proportion is 58.9%, which is only lower than that in 12 out of 100 random replicates (top left panel). The corresponding P value would be 0.12.
Extended Data Fig. 5 Tissue-level transcriptional fold-change distribution for complete duplicates.
The convention of this figure follows Fig. 3a. For each gene in each tissue, we calculated the fold-change with the median expression in duplication-present and duplication-absent lines or individuals. In flies, each tissue at least covers 19 transcribed (TPM > 1) genes. In humans, only 11 tissues or cell lines covering at least 15 transcribed genes were shown. Wilcoxon signed rank test was used to estimate the statistical significance relative to the expected 100% upregulation (top panel) and 50% upregulation (bottom panel). Despite some moderate fluctuation, the extent of upregulation generally does not deviate from the expectation (for 6 out of 7 tissues in flies, 11 out of 11 in humans). FFB, MFB and AG refer to female fat bodies, male fat bodies and accessory glands. Note that for AG, the deviation is significantly different from 100% simply because of the distribution of data most of which are smaller than 100% (purple curve). By contrast, FFB is not significant despite the smaller median (48% versus 54%).
Extended Data Fig. 6 The distribution of duplicates across different chromosomes and distribution of distances to HASs.
a) Chromosomal distribution of duplicates. We only included the major autosomes (Drosophila: 2 L, 2 R, 3 L, 3 R; Human: 1-22) and the X chromosome in the calculation. The proportion of annotated coding genes on each chromosome in the reference genome is used as the control (BG or background). The P was calculated with the one-sided proportion test. b) distances to HASs. The median distance (28,687 bp) of 8 complete duplicates relative to the closest HAS was highlighted with a green line. We generated 10,000 random samples and calculated the empirical P as the percentage of how many times random samples showed a shorter median distance. The blue line indicates the median value of random samples. The two median values are also shown.
Extended Data Fig. 7 Five fusion transcripts supported by PacBio reads and example of frameshift in KANSL1-ARL17A.
a) CG2818-CG31955 fusion. b) Rpll215-CG11697-Kmn1 fusion. c) 5’-Intergenic duplication of PpN58A. d) CG4069-Pmm2-sowah fusion. e) Subreads supporting three new introns of Sherpa-CG2469. F) KANSL1-ARL17A fusion. For Panels A to E, five out of 11 chimeric genes assembled in the testis samples of RAL-379 were confirmed with PacBio data. For A) to D), only the longest assembled contig was shown for simplicity. In A) to E), the conventions follow those in Fig. 4c. Similar to Fig. 4c, CCS reads sometimes encode novel exon/intron structures. Only the intron in the CG2818-CG31955 fusion transcript harbours the standard splicing signal (GT-AG, Panel A). Six subreads in Panel E consistently exhibit three novel introns, as shown in Fig. 4c. Note that different CCS reads have different sequence qualities, as determined by the number of low-quality supporting subreads (for example, Panel E). For example, CCS-2 has a higher quality than CCS-1 in Panel B. In Panel F, the duplicated region is framed with dashed lines. The fusion transcript involves a 2-bp intronic sequence (the splicing site, ‘gt’) causing the frameshift. Two neighbouring codons are also shown.
Extended Data Fig. 8 Sequence features of 5’-3’ fusion genes.
CDS refers to coding sequence. Whether a frameshift occurs was inferred based on the codons adjacent to the breakpoints. Domains are shown by blue (the leading gene) or red (the lagging gene) rectangles, and zigzags show mid-domain breaks. Note: the fused region of the leading gene for CG17387-ctrip, that of the lagging gene for Sherpa-CG2469 and both regions for TFDP2-XRN1 do not harbour annotated domains. The lengths of the rectangles are roughly proportional to the relative domain size within each gene. ‘IntronU’ and ‘intron’ refer to introns located within UTRs and introns within coding regions, respectively. For Prosbeta5R2-CG5681, Prosbeta5R2 was completely duplicated, whereas the 3’ part of CG5681 was duplicated. After fusion, Prosbeta5R2 harbours a longer 3’ UTR by incorporating a CG5681-derived sequence.
Extended Data Fig. 9 Sequence and transcription features of internally duplicated genes.
a) Partial duplication of the 5’ UTR in CG3409. CG3409 encodes different 5’ UTRs, one of which is duplicated and a longer 5’ UTR was generated. Because this 5’ exon is alternative the original coding sequence is presumably not affected. b) Post-duplication intron retention of TRAF3IP3. According to the duplication boundaries (highlighted in light blue), two contigs indicate inclusive transcripts where exons and flanking introns are simultaneously transcribed. c) Out-of-frame transcript of DENND5A. The duplicated region is framed with dashed line boxes. Codons are separated with commas, and amino acids are shown accordingly. The incompatibility of codon phases causes the frameshift. d) Read depth-based quantification of isoforms (see also Methods). Two examples (CG9663, SPG11) are shown. Given only 10 data points for the duplication-present lines of SPG11, we showed all specific values as grey dots. For each gene, we calculated the ratio of the read depth (duplicated region versus flanking region) in lines with or without duplication. For these two examples, the ratio found for lines with duplications was significantly (one-sided Wilcoxon signed rank test P ≤ 0.05) higher than that in lines without duplications, supporting the presence of inclusive isoforms. According to the median ratios, we further calculated the relative expression ratio of inclusive and exclusive transcripts (Fig. 5f). E) qRT–PCR-based quantification of isoforms in flies. Whole-body samples of duplication-absent lines were used as controls. Primers targeting inclusive transcripts did not show signals in the duplication-absent lines, and primers targeting both inclusive and exclusive transcripts generally generated weaker signals than their counterparts in the duplication-present lines. The error bar represents the standard deviation based on three technical replicates.
Extended Data Fig. 10 The frequency distribution of duplicates classified by the median size and proportions of duplicates across different allele frequency groups.
a) The upper and lower panels show all duplications and the duplications spanning a single gene, respectively. For each type of duplicates, we divided them into large (L) and small (S) groups according to the median size. The Wilcoxon signed rank test was used to compare the median frequencies in these two groups. Small duplicates generally show higher frequencies than large duplicates in humans. Moreover, despite the small sample size of fly dataset, larger duplications also seem to show lower allele frequencies. b) Allele frequency spectra of complete, partial and intronic duplicates in addition to PTCs and synonymous substitutions in Drosophila (the left panel) and humans (the right panel). Only duplications spanning a single gene were plotted. c) Proportion of complete, partial and intronic duplicates within different frequency groups. In the left panel, all duplicate loci were plotted with six-line allele frequency data. In the middle and right panels, only duplications spanning a single gene were plotted. The figure conventions of Panel B and C follow those of Fig. 6a and b, respectively.
Supplementary information
Rights and permissions
About this article
Cite this article
Zhang, D., Leng, L., Chen, C. et al. Dosage sensitivity and exon shuffling shape the landscape of polymorphic duplicates in Drosophila and humans. Nat Ecol Evol 6, 273–287 (2022). https://doi.org/10.1038/s41559-021-01614-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41559-021-01614-w
This article is cited by
-
Paralog transcriptional differentiation in the D. melanogaster-specific gene family Sdic across populations and spermatogenesis stages
Communications Biology (2023)
-
Pan-cancer surveys indicate cell cycle-related roles of primate-specific genes in tumors and embryonic cerebrum
Genome Biology (2022)
-
Young duplicate genic DNA
Nature Ecology & Evolution (2021)