Main

Gibbons (Hylobatidae) are critically endangered1 small apes that inhabit the tropical forests of southeast Asia (Fig. 1) and belong to the superfamily Hominoidea along with great apes and humans. In the primate phylogeny, gibbons diverged between Old World monkeys and great apes, providing a unique perspective from which to study the origins of hominoid characteristics.

Figure 1: Geographic distribution of gibbon species used in the study.
figure 1

We sequenced two individuals from each gibbon genus and two different species (H. moloch and H. pileatus) for the genus Hylobates. The extant geographic localization for each genus is illustrated on the map. Individuals in the photos are the ones sequenced in this study. The asterisk symbol indicates a deceased animal.

PowerPoint slide

Gibbons have several distinctive traits, the most striking of which is the unusually high number of large-scale chromosomal rearrangements in comparison to the inferred ancestral ape karyotype2. The four gibbon genera (Nomascus, Hylobates, Hoolock and Symphalangus) occupy different regions of southeast Asia and bear distinctive karyotypes, with diploid chromosome numbers ranging from 38 to 52 (Fig. 1). Given the relatively recent differentiation of these genera (4–6 million years ago (Myr ago), this constitutes an extraordinarily fast rate of karyotype change.

In order to investigate the mechanisms behind the plasticity of the gibbon genome, understand the evolutionary relationships among the four extant gibbon genera and study the evolution of putatively functional sequences related to gibbon-specific adaptations, we sequenced and assembled the genome of a female northern white-cheeked gibbon (Nomascus leucogenys) named ‘Asia’. The reference assembly (Nleu1.0) provides on average 5.7-fold Sanger read coverage over 2.9 gigabase pairs (Gb) (Table 1 and Supplementary Table ST1.1). Our quality assessment (Extended Data Fig. 1) confirmed its equivalence to other Sanger sequence-based non-human primate draft assemblies (such as the orangutan or rhesus macaque3,4) (Supplementary Information section S1, Supplementary Data Files 1 and 2). We also obtained 15× whole-genome shotgun (WGS) short-read data (Illumina) for two individuals of each gibbon genus and high-coverage exome data (>60×) for two of the same individuals in order to derive error models for single nucleotide polymorphism (SNP) calls (Supplementary Information section S2; Supplementary Tables ST2.1–2.3).

Table 1 Gibbon assembly statistics

Gibbon–human synteny breakpoints

Nleu1.0 scaffolds were aligned against the human reference (GRCh37) to be ordered and oriented into 26 chromosomes (Nleu3.0) under extensive guidance by cytogenetic data. The reshuffled nature of the gibbon genome was especially evident when human–gibbon chromosome alignments were compared with those between human and great apes, rhesus macaque (Old World monkey) and marmoset (New World monkey) (Fig. 2a). This higher rate of reshuffling applied only to large-scale chromosomal rearrangements (>10 megabases (Mb)), whereas smaller-scale rearrangements (10–100 kilobases (kb)) were comparable with other species (Fig. 2b) (Supplementary Information section S1).

Figure 2: Analysis of gibbon–human synteny and breakpoints.
figure 2

a, Oxford plots for human chromosomes (y axis) vs. chimpanzee, gorilla, orangutan, gibbon, rhesus macaque and marmoset chromosomes (x axis). Each line represents a collinear block larger than 10 Mb. The gibbon genome displays a significantly larger number of large-scale rearrangements than all the other species. In the gorilla plot, chromosomes 4 and 19 stand out as the product of a reciprocal translocation between chromosomes syntenic to human chromosomes 5 and 17. b, The graph shows the number of collinear blocks in primate genomes with respect to the human genome. The number of collinear blocks is a proxy for the number of rearrangements and decreases as the size of the blocks becomes larger. The gibbon genome has undergone a greater number of large-scale rearrangements; however, the number of small-scale rearrangements is comparable with the other species. The extremely low number of large rearrangements in the gorilla genome (dotted green line) is a reflection of the use of the human genome as a template in the assembly process. c, Examples of gibbon–human synteny breakpoints. The first two are class I breakpoints (that is, base-pair resolution) originated through non-homology based mechanisms. NLE12_1 is the result of an inversion in human chromosome 1 and NLE18_6 is the result of a translocation between human chromosomes 16 and 5 with an untemplated insertion in the gibbon sequence shown in purple; in both cases, micro-homologies in the human sequences are shown in red. The last example (NLE9_4) is a class II breakpoint (3.2 kb) containing a mixture of repetitive sequences.

PowerPoint slide

We identified 96 gibbon–human synteny breakpoints in Nleu1.0 and classified them as to whether they could be defined at the base-pair level (class I, n = 42) or only narrowed to an interval due to greater complexity (class II, n = 54). As previously reported5, breakpoints were significantly depleted of genes (Supplementary Fig. SF5.2 and Supplementary Data File 3) and breakpoint intervals contained a mixture of repetitive sequences that inserted exclusively into the gibbon genome2,5,6 (Fig. 2c). To assess breakpoint segmental duplication content, we identified gibbon-specific segmental duplication using in silico methods followed by experimental validation (Extended Data Fig. 2, Supplementary Fig. SF3.1, Supplementary Information section S3 and Supplementary Data File 4). Of note, both gibbon-specific segmental duplication and gene family expansion analyses suggested the gibbon genome has not undergone a greater rate of duplication than other hominoids, further supporting a model in which accelerated evolution has been limited to gross chromosomal rearrangements (Supplementary Information section S6, Supplementary Fig. SF6.1).

Segmental duplication enrichment was the best predictor of gibbon–human synteny breakpoints, as shown through permutation analyses (P value < 0.0001); however, breakpoints were also enriched for Alu elements (Supplementary Table ST5.1; Supplementary Information section S5; Supplementary Fig. SF5.2). Although non-allelic homologous recombination between highly similar sequences can mediate large-scale rearrangements7, the majority of gibbon chromosomal breakpoints bore signatures of non-homology based mechanisms (Fig. 2c). These included the insertion of non-templated sequences (2–51 nucleotides (nt)) and/or the absence of identity, suggesting non-homologous end joining. The presence of micro-homologies (2–26 nt) in a small portion of the breakpoints (13/42) pointed to additional alternative mechanisms such as microhomology-mediated end joining8 or microhomology-mediated break-induced replication9. The origin of the complex structure of breakpoint intervals (class II) was less obvious and reinforced the observation that repeats have the tendency to accumulate at the breakpoints.

To explore the possibility that chromatin conformation, rather than sequence, might predispose regions to breakage, we investigated the relationship between gibbon breakpoints and CCCTC-binding factor (CTCF), an evolutionarily conserved protein with multiple functions, including mediating intra- and interchromosomal interactions10. We performed chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) of CTCF-bound DNA using lymphoblast cell lines established from eight gibbon individuals (Supplementary Information section S5). We observed an enrichment of gibbon–human breakpoints in CTCF-binding events (P value = 0.0028), which increased when we considered a 20 kb window centred around each breakpoint (P value of < 0.0001). Notably, this enrichment was maintained only for CTCF-binding events shared with other primates (human, orangutan and rhesus macaque)11 but not those specific to gibbon (P value = 0.0019) (Supplementary Fig. SF5.4).

Thus, gibbon–human breakpoints co-localized with distinct genomic features and epigenetic marks; however, as many of these features were shared with other primates, other factors unique to the gibbon lineage must have been present to trigger the increased frequency of chromosomal rearrangements.

LAVA insertions in the gibbon genome

The gibbon genome contains all previously described classes of transposable elements that are mostly also present in other primates. One exceptional addition is the LAVA element, a novel retrotransposon that emerged exclusively in gibbons12 and has a composite structure comprised of portions of other repeats (3′-L1-AluS-VNTR-Alu-like-5′) (Fig. 3a). Searches of Nleu1.0 retrieved 1,797 LAVA insertions, 1,256 of which were 3′ intact elements, many carrying signs of target-primed reverse transcription (TPRT)13. The distribution of 3′ intact LAVA elements uncovered a significant overlap with genes (Pearson chi-squared, P = 0.017) and Gene Ontology (GO) analyses using the database for annotation, visualization, and integrated discovery (DAVID)14 showed a significant functional enrichment exclusive to the ‘microtubule cytoskeleton’ category (false discovery rate = 0.031, P value = 0.001) (Supplementary Information section S7 and Supplementary Data File 6) (Extended Data Fig. 3). Additional analyses with meta-pathway database tools15,16 refined this enrichment to pathways related to chromosome segregation, including ‘establishment of sister chromatid cohesion’ and ‘mitotic metaphase and anaphase’ (Supplementary Table ST7.3). Genes with LAVA insertions include proteins that function as checkpoints for cell division and for spindle integrity/architecture (such as MAP4, CEP164 and BUB1B)17,18,19, participate in kinetochore assembly and attachment to the spindle (for example, MAD1L1 and CLASP2)20,21, and have a role in chromosome segregation during cell division (for example, KIFAP3 and KIF27)22 (Extended Data Table 1).

Figure 3: The LAVA element and evidence for LAVA-mediated early transcription termination.
figure 3

a, Schematic view of the LAVA element highlights the main components that originated from common repeats (L1, Alu, VNTR and Alu-like). Target-site duplications (TSDs) and the poly(A) tail are also indicated. b, Luciferase reporter constructs used to assay for LAVA-mediated early transcriptional termination (left panel) and results of the luciferase reporter assay (right panel) showing increased luciferase activity by 50% relative to the background for pmiRGlo_LA_F (*P = 0.0013) (see Supplementary Information section S7.8) n = 5, five biological replicates, from five independent transfections done for each experimental condition tested. The experiment shown was replicated twice in the laboratory. Statistics were carried out using a Student’s t-test (two sided), P values for all pairwise comparisons LA_F vs. LA_E, ΔPA vs. LA_F, and ΔPA vs. LA_E respectively (with 95% CI) were adjusted for multiple comparisons according to the Bonferroni method. Centre values show the average, error bars indicate standard deviation. c, A median-joining network showing the relationships among the 22 LAVA subfamilies generated by comparing the 3′ intact LAVA elements. Coloured circles represent subfamilies and their size is proportional to the number of elements in the subfamily (numbers inside each circle). Black dots represent hypothetical sequences connecting adjacent subfamilies. All possible relationships are shown. Branch lengths are not drawn to scale.

PowerPoint slide

Intragenic LAVA insertions were skewed toward introns (Pearson chi-squared, P = 0.0001) and were less frequent than expected when within <1 kb of the nearest exon junction (Extended Data Fig. 3). The majority (74%) of intronic LAVA elements were found in the antisense orientation. We speculated that intronic antisense LAVA insertions may cause early transcription termination by providing a polyadenylation site in the antisense orientation, as previously described for L1 elements23,24 (Extended Data Fig. 3). Indeed, we found 84.1% of the 3′-intact LAVA elements encoded a perfect polyadenylation signal at their 3′ end in antisense orientation.

To obtain experimental evidence that LAVA elements disrupt transcription, we performed a reporter assay in which the 3' end of a luciferase gene construct lacking a transcriptional termination site was fused to the 3'-terminal fragments of LAVA_E and LAVA_F elements, mimicking the arrangement observed in gibbon genes (Fig. 3b, left). Luciferase activity exceeding background level by 50% was observed from the LAVA_F reporter construct (Fig. 3b, right), indicating faithful termination of luciferase transcription. Furthermore, 3′ rapid amplification of cDNA ends (RACE) experiments confirmed that the transcription termination site had been supplied from the LAVA element (Extended Data Fig. 3). Thus antisense intronic LAVA insertions can cause early transcription termination with some variability possibly due to the genomic context of the polyadenylation site, which explained the difference between the two reporter constructs.

We also investigated LAVA induced early transcription termination in vivo by analyzing RNA-seq data generated for the gibbon named Asia (Supplementary Table ST2.4). Specifically, we looked for paired-end reads only partially aligning to an antisense LAVA element due to untemplated residues and then identified cases for which the presence of a poly(A) tail was preventing full-length alignment. This analysis revealed that elements from a variety of subfamilies have the potential to cause early transcription termination, including those identified for LAVA elements inserted in the microtubule cytoskeleton genes (for example, LAVA_B2R2, LAVA_C4B, LAVA_B1R2) (Extended Data Table 1). We observed that early transcription termination occurred at relatively low levels as we identified a significant number of read pairs indicative of normal transcription and splicing for LAVA-terminated genes (Supplementary Table ST7.5). This is to be expected, as full inactivation of many of these genes would be lethal. On the other hand, as alternative splicing and RNA pol II transcript termination/polyadenylation are tightly coupled processes, LAVA-mediated early transcription termination could also act by differently affecting distinct isoforms and/or influencing the ratio between isoforms. Finally, LAVA insertions may also affect gene expression by functioning as exon traps, as shown for SVA elements25. One putative example of an exon trapping event was identified for HORMAD2, a gene that monitors the formation of synapsis during crossover26 (Supplementary Information section S7, Supplementary Table ST7.6, Supplementary Fig. SF7.1–7.2).

As genome reshuffling began in the common ancestor of all extant gibbon species, LAVA insertions must have occurred in key genes before the four genera diverged. We experimentally confirmed the mode and tempo of all 23 LAVA insertions in genes from the microtubule cytoskeleton category using both site-specific PCR and in silico methods (Extended Data Figure 4) and found that most of the insertions (15/23) were shared by the four gibbon genera (Supplementary Data File 6). Eleven of the genes match the structural requirements for early transcription termination and five of them are also shared. These genes include MAP4, involved in spindle architecture and CEP164, a G2/M checkpoint gene whose inactivation results in an aberrant spindle during cell division18,19 (Extended Data Table 1).

The complex evolutionary history of gibbons

We explored the relationship between LAVA family expansion and evolution of the gibbon lineage and, through analyses of diagnostic mutations, identified 22 LAVA subfamilies (Fig. 3c). In addition, we tested for the presence or absence of 200 LAVA loci from among the evolutionarily youngest elements in each subfamily (Extended Data Fig. 4) across 17 unrelated gibbon individuals and found that 52% of loci were shared among all four genera, whereas 27% were Nomascus specific. The remaining LAVA insertions showed a variety of confounding phylogenetic relationships consistent with incomplete lineage sorting (ILS) of ancestral polymorphisms, perhaps as a result of a rapid radiation of gibbon genera (Supplementary Information section S7; Supplementary Table ST7.1–7.2). We used a maximum likelihood method27 to obtain age estimates for the 22 LAVA subfamilies. In the case of the two oldest subfamilies, LAVA_A1 and LAVA_A2, we obtained estimates of 18 Myr ago and 17 Myr ago, respectively (Supplementary Table ST7.3). A coalescent-based methodology implemented in the software G-PhosCS28 using Nleu1.0 estimated a gibbon–great ape population divergence time of 16.8 Myr ago (95% confidence intervals (CI): 15.9–17.6 Myr ago) assuming a split time with macaque of 29 Myr ago (Supplementary Information section S4). Hence, the LAVA element probably originated around the time of the divergence of gibbons from the ancestral great ape/human lineage.

The evolutionary history of the gibbon lineage and, in particular, the timing and order of splitting among the four genera, is still a subject of debate29. To address this issue, we generated medium coverage (mean 15×) WGS short read data for two individuals from each of the four genera, including two different Hylobates species (H. moloch and H. pileatus) (Supplementary Table ST2.1–2.2). Although phylogenetic analysis of assembled whole mitochondrial DNA genomes using BEAST30 strongly supported monophyletic groupings for each gibbon genus, the branching order of the four genera remained unresolved (Supplementary Fig. SF9.1–9.2; Supplementary Information S9).

Neighbour-joining trees constructed from pairwise sequence divergence, k, across 11,000 genic (200 base pairs (bp)) and 12,000 non-genic (1 kilobase (kb)) autosomal loci supported a supermatrix sequence topology of (((Siamang (SSY), Hoolock (HLE)), Nomascus (NLE)), (H. pileatus (HPL)), H. moloch (HMO)) (Fig. 4a); nevertheless, bootstrap confidence for the node separating NLE and Hylobates was low (52%). This topology was also the most frequently observed when constructing k-based unweighted pair group method with arithmetic mean (UPGMA) trees along the genome using non-overlapping 100-kb sliding windows. However, all 15 possible rooted topologies for the four genera were observed at considerable frequencies (Extended Data Fig. 5), consistent with the extensive ILS observed in the LAVA element analysis.

Figure 4: Gibbon phylogeny and demography.
figure 4

a, The three most frequently observed UPGMA gene trees (numbers at the top) constructed across the genome at 100-kb sliding windows and posterior probabilities (numbers at the bottom) for the same species topologies from a coalescent-based ABC analysis. The relatively low numbers observed suggest presence of substantial ILS amongst the gibbon genera. b, Parameters estimates describing gibbon population demography assuming an instant radiation for all four genera (left) and the most probable bifurcating species topology (right). Black, green and red numbers indicate divergence times and Ne as calculated by ABC, BEAST and G-PhoCS analysis, respectively (Supplementary Information section S9). c, PSMC analysis estimating changes in historical Ne. The large increase in Ne observed in our PSMC plot for SSY in recent times is probably exaggerated due to higher sequencing error and mapping biases in non-NLE samples (see details in Supplementary section S8). A generation time of 10 years45,46 was used to obtain a per generation mutation rate of 1 × 10−8 per year.

PowerPoint slide

In order to infer the most likely bifurcating species topology amongst the four genera while taking into account ILS, we used a novel coalescent-based ABC methodology using the autosomal non-genic and genic loci (Veeramah et al., in the press) (Supplementary Information section S8). The topology described above had the highest combined posterior probability, though support was relatively low (P (model) = 17%) and other topologies, including one with NLE and Hylobates interchanged as the most external taxa, had comparable probabilities (Fig. 4a).

The estimated internal branch lengths under the best species topology using our ABC framework and G-PhoCS were very short, supporting a rapid speciation process for the four gibbon genera (Fig 4b, right). Given this observation and uncertainty in the best topology, we also estimated parameters under an instantaneous speciation model (Fig. 4b, left). Assuming an overall autosomal mutation rate of 1 × 10−9 per site per year, we placed the beginning of the speciation process at 5 Myr ago under both models, with the two Hylobates species diverging 1.5 Myr ago.

Consistent with the ABC analysis, SSY and HLE share the largest number of alleles across the whole genome (Supplementary Table ST8.5). However, NLE and the two Hylobates samples are both significantly closer to SSY than HLE as assessed by the D-statistic31. This result could be explained by two independent gene flow events between SSY and both NLE and Hylobates. However, fertile intergenic hybrids have yet to be observed either in the wild or captivity32; an alternative explanation would be long-term population structure in the gibbon ancestral population. Both the ABC and G-PhoCS analyses suggest that the ancestral gibbon effective population size (Ne) was large (80,000–130,000), but neither of these frameworks can distinguish this from a structured ancestral population.

The coalescent-based analysis (Fig. 4a), along with estimates of genome-wide heterozygosity (Supplementary Fig. ST8.2), suggests a larger long-term Ne for both N. leucogenys and H. moloch compared to the other species. Analysis using the pairwise sequentially Markovian coalescent (PSMC) model33 indicates that these two species underwent an increase in Ne during the Late Pleistocene era (500–100 thousand years ago (kyr ago) followed by a subsequent decrease in Ne 100–50 kyr ago (Fig. 4c) (Supplementary Information section S8). Fluctuation in Ne could result from changes in the actual number of individuals in the population, changes in population structure, and/or variable gene flow.

Functional sequence evolution

Accelerated substitution rates are a hallmark of adaptive evolution, and genomic regions with excess lineage-specific substitutions have been found to have functional roles34. We identified 240 short (153 bp) median length) regions with accelerated substitution rates in the gibbon lineage (gibARs). We observed that gibARs were primarily intergenic (66%) and tended to co-localize near the same genes as LAVA elements (P value = 81 × 10−6; odds ratio of 2.74 (95% CI: 1.79–4.07)). Consistent with this finding, a GO enrichment test for genes within ± 100 kb of each gibAR (in comparison with background genes) revealed enrichment for the ‘chromosome organization’ category (Benjamini–Hochberg false discovery rate <5%) (Extended Data Fig. 6). Given evidence of functional roles gathered for human accelerated regions35, we speculate that the gibARs may create functional elements (for example, enhancers or protein-binding domains) to modulate the transcriptional effect of local LAVA insertions (Supplementary Information section S12 and Supplementary Data File 9).

We assessed the potential presence of positive selection in 13,638 human genes with one-to-one orthologues in gibbon using a branch-site likelihood ratio test36 (Supplementary Information section S10). One of the most striking features of gibbons is their use of brachiation (arboreal locomotion using only the arms). We uncovered evidence related to traits possibly associated with this adaptation such as the gibbon’s longer arms, more powerful shoulder flexors, rotator muscles and elbow flexors37. First, some genes whose functions relate to these anatomical specializations appear to have undergone positive selection in gibbons. They include TBX5 (P value = 0.00015), required for the development of all forelimb elements38; COL1A1 (pro-alpha1 chains of type I collagen) (P value = 3.39 × 10−11), the fibril-forming collagen that is the main protein of bones, tendons and teeth39; and CHRNA1 (acetylcholine receptor subunit alpha precursor) (P value = 0.00039), involved in skeletal muscle contraction40. These genes have not been identified as positively selected in other primates to date. We also observed that some genes involved in chondrogenesis (SNX19, ID2 and EXT1) were associated with gibARs. Finally, the chondroadherin gene (CHAD)41 coding for a cartilage matrix protein is specifically duplicated in all gibbon genera (Extended Data Fig. 2).

Discussion

Our sequencing, assembling and analysis of the gibbon genome has provided numerous insights into the accelerated evolution of the gibbon karyotype and identified genetic signatures related to gibbon biology. First, segmental duplications and repetitive sequences were the best predictors of gibbon–human breakpoints, although we excluded a causal role given the predominance of non-homology-based repair signatures. Furthermore, accelerated rearrangement was confined to large-scale chromosomal events, pointing to a mechanism responsible for causing gross chromosomal changes, rather than global genomic instability. This is in line with our hypothesis that the high rate of chromosomal rearrangements may have been due to LAVA-induced premature transcription termination of chromosome segregation genes. This effect may have occurred at a low enough level to be compatible with life but sufficient to increase the frequency of chromosome segregation errors. The link between erroneous chromosome segregation and increased chromosomal rearrangement has been recently demonstrated by others through in vitro experiments25,26.

The question remains how such a high number of chromosomal rearrangements could become fixed in such a relatively short time. One possibility is that a combination of geographic isolation and post-mating reproductive barriers accelerated the radiation of the four gibbon genera. Our estimates dated the lineage-splitting event to the Miocene–Pliocene transition, when major changes in the distribution of tropical and subtropical forests were caused by the elevation of the Yunnan plateau and rise in sea levels42,43. Furthermore, fluctuation in sea levels beginning in the Early Pliocene appears to have brought about cycles of forest fragmentation and amalgamation, leading to alternating range compression and expansion for many mammalian groups44.

Together, these results advance our knowledge of the unique traits of the small apes and highlight the complex evolutionary history of these species. Moreover, our analyses of the rearranged gibbon genome help to provide insight into the mechanisms of chromosome evolution as well as uncovering a new source of genome plasticity.

Methods Summary

Sanger-based whole-genome sequencing was performed as described for other species. The genome assembly was generated using the ARACHNE genome assembler assisted with alignment data from the human genome (Supplementary Information section S1). The source DNA for the sequencing was derived from a single female (Asia; studbook no. 0098, ISIS no. NLL605) housed at the Virginia Zoo in Norfolk, Virginia. Short-read libraries were constructed at the Oregon Health & Science University (OHSU) following standard Illumina protocols and sequenced on an Illumina HiSeq 2000. Analyses were performed with custom analysis pipelines. See Supplementary Information for additional information about the methods.