A high-quality bonobo genome refines the analysis of hominid evolution

The divergence of chimpanzee and bonobo provides one of the few examples of recent hominid speciation1,2. Here we describe a fully annotated, high-quality bonobo genome assembly, which was constructed without guidance from reference genomes by applying a multiplatform genomics approach. We generate a bonobo genome assembly in which more than 98% of genes are completely annotated and 99% of the gaps are closed, including the resolution of about half of the segmental duplications and almost all of the full-length mobile elements. We compare the bonobo genome to those of other great apes1,3–5 and identify more than 5,569 fixed structural variants that specifically distinguish the bonobo and chimpanzee lineages. We focus on genes that have been lost, changed in structure or expanded in the last few million years of bonobo evolution. We produce a high-resolution map of incomplete lineage sorting and estimate that around 5.1% of the human genome is genetically closer to chimpanzee or bonobo and that more than 36.5% of the genome shows incomplete lineage sorting if we consider a deeper phylogeny including gorilla and orangutan. We also show that 26% of the segments of incomplete lineage sorting between human and chimpanzee or human and bonobo are non-randomly distributed and that genes within these clustered segments show significant excess of amino acid replacement compared to the rest of the genome.

The divergence of chimpanzee and bonobo provides one of the few examples of recent hominid speciation 1,2 . Here we describe a fully annotated, high-quality bonobo genome assembly, which was constructed without guidance from reference genomes by applying a multiplatform genomics approach. We generate a bonobo genome assembly in which more than 98% of genes are completely annotated and 99% of the gaps are closed, including the resolution of about half of the segmental duplications and almost all of the full-length mobile elements. We compare the bonobo genome to those of other great apes 1,3-5 and identify more than 5,569 fixed structural variants that specifically distinguish the bonobo and chimpanzee lineages. We focus on genes that have been lost, changed in structure or expanded in the last few million years of bonobo evolution. We produce a high-resolution map of incomplete lineage sorting and estimate that around 5.1% of the human genome is genetically closer to chimpanzee or bonobo and that more than 36.5% of the genome shows incomplete lineage sorting if we consider a deeper phylogeny including gorilla and orangutan. We also show that 26% of the segments of incomplete lineage sorting between human and chimpanzee or human and bonobo are non-randomly distributed and that genes within these clustered segments show significant excess of amino acid replacement compared to the rest of the genome.
The bonobo or pygmy chimpanzee (Pan paniscus) and the common chimpanzee (Pan troglodytes) are among the most-recently diverged ape species (around 1.7 million years ago) 1,2 . Both species represent the closest living species to humans and, therefore, offer the potential to pinpoint genetic changes that are also unique to human. The first bonobo sequence, which was generated using short-read whole-genome sequencing 1 , resulted in a genome assembly (panpan1.1) with more than 108,000 gaps in which the vast majority of segmental duplications were not incorporated and few structural variants were identified (Supplementary Table 1). As a result of the lower accuracy of early next-generation sequencing technology and the fragmentary nature of the original chimpanzee genome, large fractions of the genomes of great apes could not be compared and gene models were often incomplete 3-8 . In the past few years, long-read genome-sequencing technologies have considerably enhanced our ability to generate contiguous, high-quality genomes in which most genes and common repeat elements are fully annotated 9 . Here, we apply a multiplatform approach to produce a highly contiguous, accurate bonobo reference genome. Our analysis highlights the extent to and rapidity at which hominid genomes can differ and provides insights into incomplete lineage sorting (ILS) and its relevance to gene evolution and the genetic relationship among living hominids. Article maps and a clone-order framework using fluorescent in situ hybridization (FISH) of bacterial artificial chromosomes (BACs) 10 ( Fig. 1). The Mhudi-blu_PPA_v0 assembly assigns 74 Mb of new sequence to chromosomes, closing 99.5% of the original 108,095 gaps (Supplementary Table 5). This assembly has been annotated by NCBI and is available in the UCSC Genome Browser (panPan3, Methods, Supplementary Data and Extended Data Fig. 1). We estimate the sequence accuracy of the bonobo assembly to be 99.97-99.99% (Supplementary Table 6 and Supplementary Data). The overall nucleotide divergence between chimpanzee and bonobo based on these new long-read assemblies is 0.421 ± 0.086% for autosomes and 0.311 ± 0.060% for the X chromosome (Supplementary Table 7). Using these new assemblies, we genotyped 27 previously sequenced great ape genomes, which resulted in slight adjustments in median effective population sizes for the great apes (Extended Data Fig. 2).

Gene annotation
We predict 22,366 full-length protein-coding genes and 9,066 noncoding genes using the NCBI Eukaryotic Genome Annotation Pipeline. We also generated 867,690 full-length bonobo cDNAs (Supplementary  Table 8) and applied the Comparative Annotation Toolkit 11 to identify 20,478 protein-coding and 36,880 noncoding bonobo gene models; 99.5% of the protein-encoding models show no frameshift errors 12 and 38.4% of the protein-coding isoforms are now more complete. We identify 119 genes that have potential frameshifting insertions or deletions that disrupt the primary isoform relative to the human reference (GRCh38) (Supplementary Table 9). Respectively, 206 and 1,576 protein-coding genes are part of gene families that contracted or expanded in the bonobo genome compared to the human genome (Supplementary Tables 10, 11). We identify 65 putatively previously undescribed exons with support from full-length cDNA (Supplementary Tables 12-14), such as the protein-coding exon in ANAPC2, which is found in the bonobo but not in the chimpanzee sequence (Supplementary Fig. 2). Using other great ape genomes 13,14 and a genome-wide analysis from 20 bonobo and chimpanzee samples, we identified genes that showed an excess of amino acid replacement, balancing selection and potential selective sweeps (Tajima's D and SweepFinder2) 15

Mobile element insertions
The number of full-length (retrotransposition-competent), lineage-specific long interspersed nuclear element-1 (L1) in the bonobo genome (413 chimpanzee-specific L1 elements (L1Pt)) is similar to that in the chimpanzee genome (383 L1Pt) and 15-25% greater than the number of elements in the human genome (330 human-specific L1 elements (L1Hs)) . An analysis of Alu short interspersed nuclear element (SINE) repeats leads to a refined subfamily classification and we find that the number of bonobo-specific elements (n = 1,492) is nearly identical to that in the chimpanzee genome (n = 1,431). Pan lineages, therefore, show among the lowest rates of Alu insertions compared to the human genome (in which the rate has doubled) and the rhesus macaque genome (which shows a tenfold increased rate) (Extended Data Fig. 3). Although the bonobo genome shows a reduced genetic diversity of single-nucleotide variants 7,16 compared to the chimpanzee genome, we find that bonobo SINE-variable number tandem repeat (VNTR)-Alu (SVA) elements are more copy number polymorphic (45%) (Extended Data Fig. 3) compared to the chimpanzee genome (35%; P < 6.5 × 10 −4 ). By contrast, the chimpanzee-specific endogenous retrovirus (PtERV1) shows an indistinguishable low rate of polymorphism for PtERV1 in both species (7% for bonobo and 9% for chimpanzee), which suggests relatively little activity since the divergence of Pan (Supplementary Data).

Segmental duplications
We identified 87.4 Mb of segmental duplications (≥1 kilobase (kb) and ≥90% identity) (Extended Data Fig. 3 Table 24) and used these data to identify recent gene family expansions (Extended Data Fig. 4 and Supplementary Tables 25-31). We show, for example, that the eukaryotic translation initiation factor 4 subunit A3 (EIF4A3) gene family has expanded in both chimpanzee and bonobo genomes. There is evidence that five out of the six paralogues are expressed and encode a full-length open-reading frame ( Fig. 2 and Extended Data Fig. 5). We estimate that the initial EIF4A3 gene duplication occurred in the ancestral lineage approximately 2.9 million years ago. It then subsequently expanded and experienced gene conversion events independently in the chimpanzee and bonobo lineages, creating five and six copies of the EIF4A3 gene family, respectively. Notably, some of the gene conversion signals correspond to a set of specific amino acid changes in the basic ancestral structure that are now common to only chimpanzee and bonobo ( Fig. 2 and Extended Data Fig. 5).

Structural variation and gene disruption
As part of the assembly curation, we validated nine larger inversions that distinguish human and bonobo karyotypes, created a FISH-based chromosomal backbone ( Fig. 1) Table 41) and found segmental duplications or long interspersed nuclear elements at the breakpoints of inversions in 82% and 86% of cases, respectively (Supplementary Table 40). We also compared the bonobo genome to the human, chimpanzee and gorilla genomes to identify deletions and insertions (>50 base pairs (bp)). We classify 15,786 insertions and 7,082 deletions as bonobo-specific and genotyped these in a population of great ape samples 7,16,20 to identify 3,604 fixed insertions and 1,965 fixed deletions, of which only a small fraction (2.66% or 148 out of 5,569) intersect with genic functional elements (Supplementary Tables 42-45). Bonobo-specific events that delete ENCODE regulatory elements 21 (n = 381), for example, are enriched in membrane-associated genes with extracellular domains whereas chimpanzee-specific events (n = 187) are associated with cadherin-related genes (Supplementary Table 46). Deletions (n = 1,040) shared between the chimpanzee and bonobo genomes show an enrichment of the loss of putative regulatory elements associated with post-synaptic genes (3.32 enrichment; P = 1.2 × 10 −7 ) and pleckstrin homology-like domains (6.15 enrichment; P = 1.20 × 10 −9 ). We validate 110 events that disrupt protein-coding genes by generating high-fidelity genomic sequencing for each of the great ape reference genomes and restricting to those events that could be genotyped in a population of genomes (Supplementary Data). As expected, many fixed gene-loss events occurred in genes that are tolerant to mutation, redundant duplicated genes or genes in which the event simply altered the structure of the protein. For example, we validate a 25.7-kb gene loss of one of the keratin-associated genes (KRTAP19-6) associated with hair production in the ancestral lineage of chimpanzee and bonobo ( Supplementary Fig. 8). In the bonobo lineage, we identify five fixed structural variants that affect protein-coding genes (Supplementary Table 47), but only two of which completely ablate the gene. For example, LYPD8, which encodes a secreted protein that prevents invasion of the colonic epithelium by Gram-negative bacteria, has been completely deleted by a 24.3-kb bonobo-specific deletion. Similarly, SAMD9 (SAMD family member 9) is a fixed gene loss in bonobo as a result of a 41.46-kb bonobo-specific deletion. The other three bonobo-specific fixed structural variant events in protein-coding regions all maintain the open-reading frame, including a 49-amino acid deletion of ADAR1, which encodes a protein that is critical for RNA editing and is implicated in human disease 22-24 (Extended Data Fig. 6).

A comparison of ILS in hominids
The higher quality and more contiguous nature of the bonobo genome provide an opportunity to generate a higher-resolution ILS map. In comparison to the original bonobo assembly in which only around 800 Mb (27%) could be analysed, it is now possible to align approximately 76% of the genome in a four-way ape genome alignment (2,357 Mb within 10-kb windows) (Supplementary Table 48) owing to long-read genome assemblies 14 . We performed a genome-wide phylogenetic window-based analysis to systematically identify regions that are inconsistent with the species tree and classified these as human-bonobo and humanchimpanzee ILS topologies (Fig. 3). We predict that 5.07% of the human genome is genetically closer to chimpanzee or bonobo (Table 1); 2.52% of the human genome is more closely related to the bonobo genome (human-bonobo ILS segments) than the chimpanzee genome whereas 2.55% of the human genome is more closely related to the chimpanzee genome (human-chimpanzee ILS) than the bonobo genome (Fig. 3a). This proportion of ILS nearly doubles previous estimates (3.3%) 1 a, Multiple sequence alignment shows EIF4A3 amino acid differences between the human, Mhudiblu_PPA and chimpanzee assembled paralogues, and sequences of other great apes. A polymorphic 18-bp motif VNTR is located at the 5′ UTR of nonhuman primate EIF4A3 and accounts for most of the differences between various isoforms. A phylogenetic tree is built from neutral sequences of EIF4A3 paralogues using Bayesian phylogenetic inference. This analysis is conducted using BEAST2 software. Numbers on each major node denote estimated divergence time. Ma, million years ago. The blue error bar on each node indicates the 95% confidence interval of the age estimation. Bayesian posterior probabilities are reported using asterisks for nodes with posterior probability >99%. b, FISH on metaphase chromosomes and interphase nuclei with human probe WI2-3271P14 confirms an EIF4A3 subtelomeric expansion of chromosome 17 in bonobo and chimpanzee relative to human, gorilla and orangutan.

Article
(Supplementary Table 1). Consistent with previous observations 1 , the largest ILS segments are biased (around 1.8-fold) to intergenic regions, depleted for genes (>35%) and are particularly enriched in L1 content. Notably, the distribution of ILS segments is highly non-random based on simulation experiments. We specifically measured the distance between ILS segments (see below) and identified a subset (around 26%) of sites that are significantly more clustered than expected by chance (Fig. 3b). We focused specifically on protein-coding exons based on the human RefSeq annotation 25 and identified 1,446 exons that mapped to ILS topologies (713 exons to a human-bonobo topology and 733 exons to a human-chimpanzee topology) (Supplementary Table 49). As a whole, genes corresponding to these ILS exons are significantly enriched in both glycoprotein function (P = 1.30 × 10 −14 for human-bonobo and P = 5.60 × 10 −11 for human-chimpanzee) and calcium-binding epidermal growth factor (EGF) domain function (P = 4.40 × 10 −12 for humanbonobo and P = 9.40 × 10 −7 for human-chimpanzee) (Supplementary Table 50). We considered multiple occurrences in the same gene and identified 84 genes with at least two exons under ILS (Supplementary  Table 51) with some enrichment in photoreceptor activity (P = 1.6 × 10 −4 ) (Supplementary Table 51 and Supplementary Fig. 9) as well as EGF-like (P = 1.9 × 10 −6 ) and transmembrane (P = 2.4 × 10 −3 ) functions. Overall, we observe a significant excess of amino acid replacement (dN/dS) for all 1,446 ILS exons compared to non-ILS exons (P = 0.0048 for humanbonobo, P = 0.039 for human-chimpanzee) (Fig. 3c), which is consistent with either the action of relaxed selection or positive selection. Exons mapping to the clustered ILS segments show greater dN/dS with respect to exons in the non-clustered ILS segments, which suggests that these clustered ILS segments are contributing disproportionately to accelerated amino acid evolution in the hominid genome.
We extended the ILS analysis (Supplementary Data) across 15 million years of hominid evolution through the inclusion of genome data from orangutan and gorilla. As expected, ILS estimates for the human genome increase to more than 36.5% (Extended Data Fig. 7 and Supplementary Table 52) similar to (albeit still greater than) previous estimates 3,14 . We measured the inter-ILS distance and observed a consistent non-random pattern of clustered ILS for these deeper topologies with more ancient ILS showing an even greater proportion of clustered sites (Extended Data Fig. 7). Once again, we observe a significantly increased mean dN/dS in clustered human-chimpanzee and human-bonobo topologies (P < 2.2 × 10 −16 , mean = 0.366) as well as clustered orangutan-human and orangutan-gorilla-human topologies (P < 2.2 × 10 −16 , mean = 0.316) compared to the null distribution ( Supplementary Fig. 10). A Gene Ontology analysis 26 of the genes that intersect these combined data confirm not only the most significant signals for immunity (for example, glycoprotein (P = 1.3 × 10 −25 ) and immunoglobulin-like fold/FN3 (P = 2.4 × 10 −20 )), but also genes related to EGF signalling (P = 1.6 × 10 −13 ), solute transporter function (for example, transmembrane region (P = 1.3 × 10 −25 )) and, specifically, calcium transport (P = 3.7 × 10 −8 ) (Supplementary Table 53). Although ILS regions, in general, show diversity patterns of single-nucleotide polymorphisms that are consistent with balancing selection, it is noteworthy that both clustered and non-clustered ILS exons show a significant excess of polymorphic gene-disruptive events that are consistent with the action of relaxed as well as balancing selection ( Supplementary Fig. 11). An examination of these gene-rich clustered ILS regions reveals a complex pattern of diverse ILS topologies that suggests deep coalescence operating across specific regions of the human genome as has previously been reported for the major histocompatibility complex 1,3 (Extended Data Fig. 8).

Discussion
High-quality hominid genomes are a critical resource for understanding the genetic differences that make us human as well as the diversification of the Pan lineage over the past two million years of evolution. The bonobo represents the last of the great ape genomes to be sequenced using long-read sequencing technology. Its sequence will facilitate more systematic genetic comparisons between human, chimpanzee, gorilla and orangutan without the limitations of technological differences in sequencing and assembly of the original reference 1,3-5,14 . As a result, we now predict that a greater fraction (around 5.1%) of the human genome is genetically closer to chimpanzee or bonobo compared to previous studies (3.3%) 1 . We estimate that more than 36.5% of the hominid genome shows ILS if we consider a deeper phylogeny that includes gorilla and orangutan. Notably, 26% of the ILS regions are clustered and exons that underlie these clustered ILS signals show elevated rates of amino acid replacement. These findings support a previous study in gorilla that showed a subtler correlation in which genes with higher dN/dS values are enriched in ILS segments 3 . In that study, however, the authors explained the observation as a result of stronger purifying selection in non-ILS sites or background selection that reduced the effective population size and, as a result, led a depletion of ILS. Our genome-wide exon analyses specifically show that only a subset of clustered ILS exons are driving this effect and that these genes are enriched in glycoprotein and EGF-like calcium signalling functions owing to the action of either relaxed selection or positive selection of genes in these pathways (Supplementary Data).

Online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41586-021-03519-x.  Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. © The Author(s) 2021  ((B, H), C)) (G, ((H, C), B)) (G , ((B, H), C)) (G, ((H, C) ((B, H), C)) and (G, ((H, C), B)) represent two different ILS topologies. Intergenic/intragenic indicates the intergenic to intragenic ratio. a Content is shown as a percentage; the GC, Alu, L1 and exon contents are based on the GRCh38 genome.

Methods
We sequenced and assembled the genome of a single female bonobo (Mhudiblu, also known as Mhudibluy, who was obtained from the San Diego Zoo, ISIS 601152, born 15 April 2001 and who was later transferred to the Wuppertal Zoo in Germany where she was referred to as Muhdeblu) using long-read PacBio RS II sequencing chemistry and the Falcon genome assembler. The assembly was error-corrected using Quiver 27 , Pilon 28 and an in-house FreeBayes-based 29 insertion or deletion correction pipeline optimized to improve continuous long-read assemblies 14 . We also generated Illumina whole-genome sequencing (WGS) data using the Illumina TruSeq PCR-Free library preparation kit. Genome assembly contigs were ordered and oriented into scaffolds using Bionano optical maps (Supplementary Table 54 and Supplementary Data) (HybridScaffolds suite, Bionano Genomics Saphyr platform) and four-colour FISH of 324 BAC clones. Cell lines from chimpanzee, bonobo, gorilla and orangutan were obtained from Coriell (S006007) or from a collection developed by M. Rocchi; no approval from ethics committees were required for use of these established lines. We assigned each contig and scaffold into unique groups corresponding to individual chromosomal homologues using SaaRclust 30,31 while applying Strand-seq to detect inversions, assign orphan contig and orient contigs 32,33 . To estimate genome-wide sequence accuracy, we applied Merqury 34 using Illumina WGS data. We also generated a bonobo large-insert BAC library (VMRC74) and selected at random 17 clones for complete PacBio insert sequencing 35 . The Comparative Annotation Toolkit (CAT) 11 was used for genome annotation using human GENCODE v.33 and RNA-sequencing data. We also generated more than 860,000 full-length non-chimeric transcripts from full-length isoform sequencing (Iso-Seq) data generated from induced pluripotent stem cell and derived neuronal progenitor cell lines 36 from bonobo sample AG05253 and we searched for gene structures split over multiple contigs (Supplementary Table 55). Repeat content of the assembled genome was analysed using RepeatMasker (RepeatMasker-Open-4.1.0) and the Dfam3 repeat library. We assigned lineage-specific Alu and full-length long interspersed nuclear element, SVA_D and PtERV elements to subfamilies by applying COSEG (http://www.repeatmasker.org/COSEGDownload.html) to determine the lineage-specific subfamily composition. For cross-species analysis of mobile element insertions (MEIs), we performed liftOver on the basis of the chains built from the Cactus whole-genome alignments generated during CAT annotation. For cross-assembly analyses of bonobo MEI insertions and a specific subset of other analyses (Supplementary Data), we used Bowtie 2 to map MEI flanking sequences between genomes. We estimated the duplication content in the bonobo assembly, applying the whole-genome analysis comparison method 37 and targeted collapsed duplications for assembly using Segmental Duplication Assembler 19 . Insertions and deletions were detected in bonobo, chimpanzee and gorilla using PBSV, Sniffles 38 and Smartie-sv 14 and genotyped using Paragraph 39 against a panel of 27 Illumina WGS genomes. We searched for evidence of ILS among the chimpanzee, gorilla and human lineages applying Prank (v.140110) to construct multiple sequence alignments and using ete3 module to identify segments and exons under ILS (Supplementary Table 56). For consistency, NCBI reference genome nomenclature has been used throughout the manuscript and corresponds to the following UCSC IDs (NCBI/UCSC): panpan1.1/panPan2, Mhudiblu_PPA_v0/panPan3, Clint_PTRv2/panTro6, Kamilah_GGO_v0/ gorGor6, Susie_PABv2/ponAbe3 and GRCh38/hg38 (details of the methods used are provided in the Supplementary Data).

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this paper.

Data availability
The Mhudiblu_PPA_v0 (GCA_013052645.  Fig. 7 | Hominid ILS. The distance between adjacent ILS segments (inter-ILS) (500-bp resolution) was calculated and the distribution was compared to a simulated expectation based on a random distribution. The analysis reveals a bimodal (and possibly an emerging trimodal) pattern in which a distinct subset of ILS segments are clustered (that is, clustered ILS sites). Four different topologies were considered. a, A (orangutan, (((bonobo, chimpanzee), gorilla), human)) ILS topology in which 31.58% of inter-ILS is clustered is shown. b, A (orangutan, ((bonobo, chimpanzee), (gorilla, human))) ILS topology in which 33.5% is clustered is shown. c, A (orangutan, (((bonobo, human), chimpanzee), gorilla)) ILS topology in which 8.14% is clustered is shown. d, A (orangutan, ((bonobo, (chimpanzee, human)), gorilla)) ILS topology in which 9.89% of sites is clustered is shown. e, An example of a cluster of human-bonobo (red triangles) and human-chimpanzee (blue triangles) ILS corresponding to a group of genes. A four-species alignment of one exon from EGF (exon 5) is shown with a nominal signal of positive selection. Fig. 8  (mean = 4.45 × 10 −4 ) and chimpanzee (mean = 9.35 × 10 −4 ). A region of reduced diversity (grey) is shown that corresponds to a segmental duplication in which single-nucleotide polymorphisms were excluded due to potential mismapping. d, Same as c but merged onto the same scale and highlighting five regions (red arrows) in which diversity is reduced in bonobo compared to chimpanzee. Three of these correspond to previously identified regions 1 ; however, they are not among the top 1% of genome candidates showing positive selection by Tajima's D and SweepFinder2 15 . The overall diversity of single-nucleotide polymorphisms is reduced across the region in bonobo compared to chimpanzee.

nature research | reporting summary
April 2020 Data Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated raw data -A description of any restrictions on data availability All the data and accession codes have been reported in the manuscript Field-specific reporting Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection. All studies must disclose on these points even when the disclosure is negative.

Study description
We sequenced and assembled a new bonobo reference genome using a multiplatform approach. The genome is more contiguous and accurate allowing more comprehensive sequence alignement. We discovered new species specific structural variants including gene family expansions and deletions in the ape lineage. We provide a more complete view of incomplete lineage sorting and its nonrandom clustering during ape genome evolution.

Research sample
We sequence a bonobo (Pygmy chimpanzee) immortalised cell line (Carbone #601152). The source of the cells was an EBV transformed lymphoblast cell line from a single female bonobo, Mhudiblu. Pygmy chimpanzee was chosen because of its importance for inferring species specific changes in both human and chimpanzee lineages. Together with chimpanzee, bonobos represent the closest great apes to human genome. The sample we sequenced is representative of Pan paniscus.

Sampling strategy
No sample size calculation was performed. We were searching for genomics and transcriptomics similarities/differences between Pan paniscus and other great ape genomes. For this purpose, deep whole genome long-read sequencing with the Pacific Biosciences Sequel II platform was performed and variants were then genotyped on a population of samples to confirm fixed or polymorphic status.

Data collection
Sequencing data for assembly were collected using Pacific Bioscience Sequel II Instrument Control SW (v7.1 or v8.0); while cytogenetics data were generated using a Leica fluorescence microscope and Leica Application Suite X (v3.7).
Timing and spatial scale No Timing or spatial scale was applied

Data exclusions
No data were specifically excluded

Reproducibility
Computational experiments are deterministic and are, therefore, reproducible. Despite this expected reproducibility, computational experiments were performed multiple times with different parameters and followed up with experimental validation. All attempts at replication were successful.