The koala, the only extant species of the marsupial family Phascolarctidae, is classified as ‘vulnerable’ due to habitat loss and widespread disease. We sequenced the koala genome, producing a complete and contiguous marsupial reference genome, including centromeres. We reveal that the koala’s ability to detoxify eucalypt foliage may be due to expansions within a cytochrome P450 gene family, and its ability to smell, taste and moderate ingestion of plant secondary metabolites may be due to expansions in the vomeronasal and taste receptors. We characterized novel lactation proteins that protect young in the pouch and annotated immune genes important for response to chlamydial disease. Historical demography showed a substantial population crash coincident with the decline of Australian megafauna, while contemporary populations had biogeographic boundaries and increased inbreeding in populations affected by historic translocations. We identified genetically diverse populations that require habitat corridors and instituting of translocation programs to aid the koala’s survival in the wild.
The koala is an iconic Australian marsupial, instantly recognizable by its round, humanoid face and distinctive body shape. Fossil evidence identifies as many as 15–20 species, following the divergence of koalas (Phascolarctidae) from terrestrial wombats (Vombatidae) 30–40 million years ago1,2 (Supplementary Fig. 1). The modern koala, Phascolarctos cinereus, which first appeared in the fossil record ~350,000 years ago, is the only extant species of the Phascolarctidae. Like other marsupials, koalas give birth to underdeveloped young. Birth occurs after just 35 d of gestation, with young lacking immune tissues or organs. Their immune system develops while they are in the pouch, meaning survival during early life depends on immunological protection provided by mothers’ milk.
A specialist arboreal folivore feeding almost exclusively from Eucalyptus spp., the koala has a diet that would be toxic or fatal to most other mammals3. Due to the low caloric content of this diet, the koala rests and sleeps up to 22 h a day4. A detailed understanding of the mechanisms by which koalas detoxify eucalyptus and protect their young in the pouch has been elusive, as there are no koala research colonies and access to milk and tissue samples is opportunistic. The genome enables unprecedented insights into the unique biology of the koala, without having to harm or disturb an animal of conservation concern.
The genome also enables a holistic, scientifically grounded approach to koala conservation. Australia has the highest mammal extinction record of any country during the Anthropocene5, and koala numbers have plummeted in northern parts of its range since European settlement of the continent6, but increased in southern sections of the range, notably in parts of Victoria and South Australia. The uneven response of koala populations throughout its range is one of the most difficult issues in its management7. The species was heavily exploited by a pelt trade (1870s to late 1920s), which harvested millions of animals6,8,9. Today, the threats are primarily due to loss and fragmentation of habitat, urbanization, climate change and disease. Current estimates put the number of koalas in Australia at only 329,000 (range 144,000–605,000), and a continuing decline is predicted6. Koalas present a complex conservation conundrum: in the north, causes of decline include ongoing habitat fragmentation, urbanization and disease. However, decline in the south has followed a different path10, with widespread, often sequential, translocations (1920–1990s) from a limited founder population, which has resulted in genetically bottlenecked populations that are overabundant to the point of starvation in some areas11. There are marked differences in the degree to which threats affect each population, thereby cautioning against one prescription for population recovery.
Adding to the complexity of koala conservation is the impact of disease, specifically koala retrovirus (KoRV) and Chlamydia. KoRV is thought to have arrived in Australia via a putative murine vector before cross-species transmission12,13. It is now prevalent in northern koalas and appears to be spreading to southern populations14. Some strains appear to be more virulent than others and are putatively associated with an increase in neoplastic disease15. Similarly, Chlamydia, which in some individuals causes severe symptoms yet in others remains asymptomatic, may have crossed the species barrier from introduced hosts such as domestic sheep and cattle following European settlement16. A complete koala genome offers insights into the species’ genetic susceptibility to these diseases, provides the genomic basis for innovative vaccines, and can underpin new conservation management solutions that incorporate the species’ population and genetic structure, such as facilitating gene flow via habitat connectivity or translocations.
Koalas have 16 chromosomes, differing from the ancestral marsupial 2n = 14 karyotype by a simple fission of ancestral chromosome 2 giving rise to koala chromosomes 4 and 717. We sequenced the complete genome using 57.3-fold PacBio long-read coverage, generating a 3.42 Gb reference assembly. The primary contigs from the FALCON assembly (representing homozygous regions of the genome) yielded genome version phaCin_unsw_v4.1. This comprised 3.19 Gb, including 1,906 contigs with an N50 of 11.6 Mb and the longest at 40.6 Mb. The heterozygous regions of the genome (representing the alternate contigs from the assembly) totaled 230 Mb, with an N50 of 48.8 kb (Table 1, Supplementary Tables 1–3 and Methods). Approximately 30-fold coverage of Illumina short reads was used to polish the assembly. BioNano optical maps plus additional conserved synteny information for marsupials were used for scaffolding18 to assemble long-read contigs into ‘virtual’ chromosome scaffolds (‘super-contigs’) (Supplementary Tables 4 and 5 and Supplementary Note). The largest super-contig spanned approximately half of koala chromosome 7 (Supplementary Fig. 2).
Our long-read-based sequence presented the opportunity to identify and study centromeres, which are multi-megabase regions that are challenging to construct in eutherian (for example, human and mouse)19 genome assemblies due to intractable higher order arrays of satellites. Centromeres are smaller in marsupials than in eutherians, and as such are more amenable to analysis20. Chromatin immunoprecipitation and sequencing using antibodies to centromeric proteins (CENP-A and CREST)21 enabled the identification of scaffolds containing putative centromeric regions (Supplementary Fig. 3) and the characterization of known and new repeats, including composite elements within koala centromeric domains (Supplementary Table 6–10) that lack the previously annotated retroelement, kangaroo endogenous retrovirus (KERV), found in some tammar wallaby centromeres22. Koala centromeres span a total of 2.6 Mb of the koala haploid genome, equivalent to an average of 300 kb of centromeric material per chromosome. Like those of other species with small centromeres19,20,23,24, koala centromeres lack higher order satellite arrays (Supplementary Tables 7–10). Among the newly identified repeats, some are similar to composite elements recently described in gibbon centromeres25, where absence of higher order satellite arrays accompanied the evolution of new composite elements with putative centromere function. The composition of the koala centromere therefore supports mounting evidence that transposable elements represent a major, functional component of small centromeres when higher order satellite arrays are absent20,24,25.
Interspersed repeats account for approximately 47.5% of the koala genome; 44% of these are transposable elements (Supplementary Table 11). As in other mammalian genomes, short interspersed nuclear elements (SINEs) and long interspersed nuclear elements (LINEs) are the most numerous elements (35.2% and 28.9% of total number of elements, respectively), with LINEs making up 32.1% of the koala genome. The long-read sequence assembly also enabled full characterization and annotation of repeat-rich long noncoding RNAs, including RSX, which mediates X chromosome inactivation in female marsupials26. Koala RSX represents the first marsupial RSX to be fully annotated and to have its structure predicted (Supplementary Fig. 4 and Supplementary Note). As expected, it was expressed in all female tissues, but in no male tissues27.
The assembled koala genome has very high coverage of coding regions: we recovered 95.1% of 4,104 mammalian benchmarking universal single-copy orthologs (BUSCOs)28, the highest value for any published marsupial genome (Supplementary Table 5) and comparable with that of the human assembly (GRCh38, which scores 94.1% of orthologs). Analysis of gene family evolution using a maximum-likelihood framework identified 6,124 protein-coding genes in 2,118 gene families with at least two members in koala. Among these, 1,089 have more gene members in koala than in any of the other species (human, mouse, dog, tammar wallaby, Tasmanian devil, gray short-tailed opossum, platypus, chicken; Supplementary Fig. 5).
Having characterized the genome, we undertook detailed analyses of key genes and gene families to gain insights into the genomic basis of the koala’s highly specialized biology. Gene families of particular interest were those that encode proteins involved in induced ovulation, those proteins involved in the complex lactation process, those proteins responsible for immunity, and those enzymes that enable the koala to subsist on a toxic diet.
Ability to tolerate a highly toxic diet
The koala’s diet of eucalyptus leaves contains high levels of plant secondary metabolites29, phenolic compounds30 and terpenes (for example, ref. 31) that would be lethal to most other mammals32. Koalas thus experience little competition for food resources. Eucalyptus grandis shows substantial expansion in terpene synthase genes relative to other plant genomes33. Eucalypt toxicity is therefore likely to have exerted selection pressure on the koala’s ability to metabolize such xenobiotics, so we searched for genes encoding enzymes with a detoxification function and investigated sequence evolution at these loci.
Cytochrome P450 monooxygenase (CYP) genes represent a multi-gene superfamily of heme-thiolate enzymes that play a role in detoxification through phase 1 oxidative metabolism of a range of compounds including xenobiotics34. These genes have been identified throughout the tree of life, including in plants, animals, fungi, bacteria and viruses35. In the koala genome we found two lineage-specific monophyletic expansions of the cytochrome P450 family 2 subfamily C (CYP2Cs, 31 members in koala) (Fig. 1a). The functional importance of these CYP2C genes was further demonstrated through analysis of expression in 15 koala transcriptomes from two koalas, showing particularly high expression in the liver, consistent with a role in detoxification (Supplementary Fig. 6).
Comparing CYP2C gene context in mouse versus koala identified conserved flanking markers strongly suggestive of tandem duplication (Fig. 1b). Further sequence-level analysis of the CYP expansions indicated that most conserved regions are under strong purifying selection (Fig. 1c). However, there is evidence that individual CYP codons have experienced episodic diversifying selection while purifying selection shapes the rest of the gene (Fig. 1c–h, Supplementary Note and Supplementary Tables 12 and 13). Adaptive expansion of CYP2C and maintenance of duplicates appear to have worked in concert, resulting in higher enzyme levels for detoxification while the interplay between purifying and diversifying selection resulted in neofunctionalization within the CYPs. Such adaptations enable koalas to detoxify their highly specialized diet rich in plant secondary metabolites.
The characterization of koala CYP2Cs has significant therapeutic potential. The high expression levels of CYP2C genes in the liver helps to explain why meloxicam, a nonsteroidal anti-inflammatory drug (NSAID) known to be metabolized by the protein product of CYP2C in humans36,37 and frequently used for pain relief in veterinary care, is so rapidly metabolized in the koala and a handful of other eucalypt-eating marsupials (common brushtail possum and eastern ringtail possum) compared with eutherian species37,38. It is expected that other NSAIDs are also rapidly metabolized in koalas and have little efficacy at suggested doses39. Anti-chlamydia antibiotics such as chloramphenicol are degraded rapidly by koalas; treatment with a single dose applicable to humans is insufficient in koalas, which require a daily dose for up to 30 to 45 d. This discovery of CYP2C gene expression levels will inform new research into the pharmacokinetics of medicines in koalas.
Taste, smell and food choice
Like many specialist folivores, koalas are notoriously selective feeders, making food choices both to target nutrients and to avoid plant secondary metabolites40. Koalas have been observed to sniff leaves before tasting them41, and their acute discrimination has been correlated with the complexity and concentration of plant secondary metabolites42. This suggests an important role for olfaction and vomerolfaction, as well as taste. While most herbivores circumvent plant chemical defenses by detoxifying one or a few compounds43, the complexity of eucalyptus plant secondary metabolites, in combination with the terpene expansion in eucalypts, led us to hypothesize that the koala requires enhanced capabilities both in specialist detection and in plant secondary metabolite detoxification. We therefore investigated the genomic basis of the koala’s taste and smell senses, finding multiple gene family expansions that could enhance its ability to make food choices.
We report an expansion of one lineage of vomeronasal receptor type 1 (V1R) genes associated with the detection of nonvolatile odorants (Supplementary Note). There are six such genes in koala, compared with only one in the Tasmanian devil and gray short-tailed opossum, and none found in tammar wallaby, human, mouse, dog, platypus or chicken. The expansion of one lineage of V1R genes is consistent with the koala’s ability to discriminate among diverse plant secondary metabolites.
Surprisingly, given the degree of its dietary specialization, the olfactory receptor genes (n = 1,169) characterized in koala had a gene repertoire that was slightly smaller than that of gray short-tailed opossum (1,431 genes), tammar wallaby (1,660 genes) and Tasmanian devil (1,279 genes) (Supplementary Note). This may be understood in the context of relaxed selection on olfactory receptors among dietary specialists44.
We also report genomic evidence of expansions within the taste receptor families that would enable the koala to optimize ingestion of leaves with a higher moisture and nutrient content in concert with the concentration of toxic plant secondary metabolites in their food plants. The koala’s ability to ‘taste water’ is potentially enhanced by an apparent functional duplication of the aquaporin 5 gene45,46,47 (Supplementary Table 14 and Supplementary Note).
The TAS2R family has a role in ‘bitter’ taste, enabling recognition of structural toxins such as terpenes, phenols and glycosides. These are found in various levels in eucalypts as plant secondary metabolites3,30,31,48. In marsupials, the TAS2R family includes the orthologous repertoires from eutherians, as well as three specific expansions in the last common ancestor shared by all marsupials49,50 (Fig. 2). Large koala-specific duplications in four marsupial orthologous groups have produced a large koala TAS2R repertoire of 24 genes (Fig. 2). The koala has more TAS2Rs than any other Australian marsupial, and among the most of all mammal species49,50, including paralogs of human and mouse receptors whose agonists are toxic glycosides (Supplementary Table 15 and Supplementary Note). The TAS1R gene families, responsible for sweet taste and umami amino acid perception, have previously been reported as pseudogenized in eutherians with highly specialized diets, such as the giant panda51. In the koala, however, we found that all TAS1R genes are putatively functional (Supplementary Fig. 7).
Genomics of an induced ovulator
Koala reproduction is of particular interest because the koala is an induced ovulator52, with key genes controlling female ovulation (LHB, FSHB, ERR1, ERR2), as well as prostaglandin synthesis genes important in parturition and ejaculation (PTGS1, PTGS2, PTGS3) (Supplementary Note). We identified genes putatively involved in the induction of ovulation in the female by male seminal plasma (NGF), and in coagulation of seminal fluid (ODC1, SAT1, SAT2, SMOX, SRM, SMS) (Supplementary Note), which may function to prevent sperm leakage from the female reproductive tract in this arboreal species.
Genomic characterization of koala milk
A koala young is about the size of a kidney bean and weighs < 0.5 g. It crawls into the mother’s posteriorly opening pouch and attaches to a teat, where it remains for 6–7 months. It continues to suck after it has left the pouch until about a year old.
Analysis of the genome, in conjunction with a mammary transcriptome and a milk proteome, enabled us to characterize the main components of koala milk (Supplementary Fig. 8, Supplementary Table 16, Supplementary Note and ref. 53). The high-quality assembly of the genome allowed both the identification of marsupial-specific genes and determination of their evolutionary origins based on their genomic locations. For instance, we found that there are four Late Lactation Protein (LLP) genes tightly linked to both trichosurin and β-lactoglobulin (Supplementary Fig. 8), potentially allowing marsupials to fine-tune milk protein composition across the stages of lactation to meet the changing needs of their young. Additionally, the koala marsupial milk 1 (MM1) gene, a novel marsupial gene, is located close to the gene encoding very early lactation protein (VELP), an ortholog of Glycam1 (or PP3) that encodes a eutherian antimicrobial protein53 (Supplementary Fig. 8). In eutherians, this region contains an array of short glycoproteins that have antimicrobial properties and are found in secretions such as milk, tears and sweat. We propose that MM1 has an antimicrobial role in marsupial milk, along with three other short novel genes located in the same region. We also detected expansions in another antimicrobial gene family, the cathelicidins.
Koala immunome and disease
At the time of European settlement, koalas were widespread in eastern mainland Australia, from north Queensland to the southeastern corner of South Australia. Today they are mainly confined to the east coast and are listed as ‘vulnerable’ under Australia’s Environment Protection and Biodiversity Conservation Act 199954. There is strong evidence to suggest that some fragmented populations of koalas are already facing extinction, particularly in formerly densely populated koala territories in southeast Queensland and northern New South Wales. A major challenge for the conservation of these declining koala populations is the high prevalence of disease, especially that caused by the obligate intracellular bacterial pathogen Chlamydia pecorum, which is found across the geographic range, with the exception of some offshore islands55. A main challenge for managing these populations has been the lack of knowledge about the koala immune response to disease. Recent modeling suggests the best way to stabilize heavily affected koala populations is to target disease56.
The long-read-based genome enabled the de novo assembly of complex, highly duplicated immune gene families and comprehensive annotation of immune gene clusters53,57,58. These include the major histocompatibility complex (MHC)59, as well as T cell receptors (TCR), immunoglobulin (IG) (Supplementary Fig. 9, Supplementary Tables 17 and 18, and Supplementary Note), natural killer cell (NK) receptor58 and defensin60 gene clusters. Together these findings provide a starting point for new disease research and allow us to interrogate the immune response to the most significant pathogen of the koala, C. pecorum.
Of the more than 1,000 koalas arriving annually at wildlife hospitals in Queensland and New South Wales, 40% have late-stage chlamydial disease and cannot be rehabilitated. Annotation of koala immune genes enabled us to study variation within candidate genes known to play a role in resistance and susceptibility to chlamydia infection in other species (Supplementary Tables 18–20). Preliminary case/control association tests for five koalas involved in a chlamydia vaccination trial showed that the MHCII DMA and DMB genes, as well as the CD8-a gene, may be involved in differential immune responses to chlamydia vaccine (Supplementary Table 21 and Supplementary Note). We also conducted differential expression analysis of RNA sequencing (RNA-seq) data from conjunctival tissue collected from koalas at necropsy, both with and without signs of ocular chlamydiosis, showing that in diseased animals, 1,508 of the 26,558 annotated genes (5.7%) were twofold upregulated, while 685 (2.6%) were downregulated by greater than twofold when compared with healthy animals (Supplementary Fig. 9 and Supplementary Note). In diseased animals, upregulated genes were associated with Gene Ontology (GO) terms for a range of immunological processes, including signatures of leukocyte infiltration (Supplementary Fig. 9). Immune responses in the affected conjunctivas were directed at TH1 rather than TH2 responses. Proinflammatory mediators such as CCL20, IL1α, IL1β, IL6 and SSA1 were also upregulated. As in human trachoma, this cascade of proinflammatory products may help to clear the infection but may also lead to tissue damage in the host61. Furthermore, resolution of human trachoma infection is thought to require a IFN-γ driven TH1 response62, and in diseased koalas we found that IFN-γ was upregulated 4.7-fold in the conjunctival tissue. These annotated koala immune genes will now help us to define features of protective versus pathogenic immunological responses to the disease and may be invaluable for effective vaccine design.
Koala genomes are undergoing genomic invasion by koala retrovirus (KoRV)63, which is spreading from the north of the country to the south. Both endogenous (germline transmission) and exogenous (infectious ‘horizontal’ transmission) forms are extant64. Our results provide a comprehensive view of KoRV insertions in the koala genome. We found a total of 73 insertions in the phaCin_unsw_4.1 assembly (Supplementary Table 22). It is likely that most of these 73 loci are endogenous, consistent with our observation of integration breakpoint sequences that are shared with one or both of the other koala genomes reported (Supplementary Tables 23 and 24).
We investigated the sites of KoRV insertion to define their proximity to protein-coding genes and explore possible disruptions. This analysis identified insertions into 24 protein-coding genes (Supplementary Table 25). However, none is likely to disrupt protein-coding capacity, since 22 insertions are in introns and the other two are in 3′ untranslated regions. Transcription proceeding from the proviral long terminal repeat (LTR) could possibly affect the transcription of the host genes.
Understanding the genetics of host resistance to chlamydia and the etiology of the retrovirus will help inform the development of vaccines against both diseases, as well as translocation strategies.
Broad-scale population management of koalas is critical to conservation efforts. This is challenging because distribution models are not easily generalized across bioregions, and further complicated by the unique regional conservation issues described above. Since it is not possible to generalize management, it is imperative that decisions are informed by empirical data relevant to each bioregion.
Analysis of the koala genome provided the unique opportunity to combine historical evolutionary data with high-resolution contemporary population genomic markers to address these management challenges. To infer the ancient demographic history of the species, we analyzed the long-read reference genome and short-read data from two other koalas, using the pairwise sequentially Markovian coalescent (PSMC) method65 (Fig. 3a, Supplementary Fig. 10 and Methods). The data show that the modern koala, which appeared in the fossil record 350,000 years ago2, underwent an initial increase in population, followed by a rapid and widespread decrease in population size ~30,000–40,000 years ago. This is consistent with fossil evidence of rapid declines in multiple Australian species, including the extinct megafauna, 40,000–50,000 years ago66 and 30,000–40,000 years ago67. The koala was thus one of a number of species affected by decline during this time that did not ultimately become extinct67.
Distinct PSMC profiles of the koalas from two geographic areas and their failure to coalesce suggests some regional differences in koala populations, including impediments to gene flow (Fig. 3a). Regional differentiation was also detected in analyses of mtDNA68,69, although over a shorter time scale.
We analyzed populations of recent koala samples using 1,200 SNPs derived from targeted capture libraries mapped to the koala genome (Supplementary Note). We found notable levels of genetic diversity with limited fine-scale differentiation consistent with long-term connectivity across regions. We found evidence of low genetic diversity in southern koalas, consistent with a recent history of sequential translocations8,68,70,71 (Fig. 3b,c). At a continental scale, we show biogeographic barriers to gene flow associated with the Brisbane Valley and Clarence River, as identified by mtDNA studies68,72, and find a barrier associated with the Hunter Valley, which was not previously known in koalas (Fig. 3b). Levels of inbreeding varied across regions (Fig. 3c), but the northern populations most under threat in New South Wales and Queensland show high levels of genetic diversity.
The information generated here provides a foundation for a conservation management strategy to maintain gene flow regionally while incorporating the genetic legacy of biogeographic barriers. Furthermore, the contrast in genome-wide levels of diversity between southern and northern populations highlights the detrimental consequences of the unmonitored use of small isolated populations as founders for reestablishing and/or rescuing of populations on genome-wide levels of genetic diversity. Low levels of genetic diversity in southern koalas have been associated with genetic abnormalities consistent with inbreeding depression, including testicular abnormalities73.
Now that we understand the consequences of past translocations, and the existing genetic structure, it is clear that maintaining and facilitating gene flow via habitat connectivity will be the most effective means of ensuring genetically healthy koala populations over the long term. However, where more intensive measures such as translocation are required to rescue genetically depauperate southern populations, these tools and data provide the basis for decisions that maximize benefits while minimizing risks74,75. Future utility of these SNPs will also include tracking of individual pedigrees in captive koala populations and in those wild populations being intensively monitored.
The koala genome offers insights into historic and contemporary population dynamics, providing evolutionary and genetic context for a species that is the focus of considerable management actions and resources. By providing a deeper understanding of disease dynamics and population genetic processes, including the maintenance and monitoring of gene flow, this genomic information will enable the development of strategies necessary to preserve the species, from the preservation of habitat corridors through to the genetic rescue of isolated populations. As members of government advisory committees, some of the authors have initiated inclusion of genomic information into the New South Wales Koala Strategy. This will be used to inform koala management in the state with the goal of securing koalas in the wild for the future.
The koala genome provides the highest quality marsupial genome to date. This assembly has enabled insights into the colonization of the koala genome by an exogenous retrovirus and revealed the architecture of the immune system, necessary to study and treat emerging diseases that threaten koala populations. A greater understanding of genetic diversity across the species will guide the selection of individuals from genetically healthy northern populations to augment genetically restricted populations in the south, bearing in mind that chlamydia has not been detected on some offshore islands, so risk assessment should be carried out before embarking on translocations. Sequencing the genome has advanced our understanding of the unique biology of the koala, including detoxification pathways and innovations in taste and smell to enable food choices in an obligate folivore. Long-term survival of the species depends on understanding the impacts of disease and management of genetic diversity, as well as the koala’s ability to source moisture and select suitable foraging trees. This is particularly important given the koala’s narrow food range, which makes it especially vulnerable to a changing climate. The genome provides a springboard for conservation of this biologically unique and iconic Australian species.
FALCON assembly algorithm, https://github.com/PacificBiosciences/FALCON-integrate/; FALCON (v 0.3.0), http://falconframework.org/; RepeatMasker (v 4.0.3), http://www.repeatmasker.org/; RepeatModeler, http://www.repeatmasker.org/RepeatModeler/; RepBase (v 2015-08-07), http://www.girinst.org/repbase/; MAKER, http://www.yandell-lab.org/software/maker.html; Trinity (v 2.3.2), https://github.com/trinityrnaseq/trinityrnaseq/; SNAP, http://archive.broadinstitute.org/mpg/snap/; GeneMark, http://opal.biology.gatech.edu/GeneMark/; Augustus, http://bioinf.uni-greifswald.de/augustus/; NCBI Blast (v 2.3.0), https://blast.ncbi.nlm.nih.gov/Blast.cgi; OrthoMCL (v 2.0.9), http://orthomcl.org/orthomcl/; MAFFT (v 7.2.71), https://mafft.cbrc.jp/alignment/software/; TreeBeST (v 1.9.2), http://treesoft.sourceforge.net/treebest.shtml; HyPhy, https://veg.github.io/hyphy-site/; Datamonkey, http://www.datamonkey.org/; STAR, http://star.mit.edu/genetics/; featureCounts, http://bioinf.wehi.edu.au/featureCounts/; DESeq2, https://bioconductor.org/packages/release/bioc/html/DESeq2.html; SARTools, https://github.com/PF2-pasteur-fr/SARTools/; Dotter, https://sonnhammer.sbc.su.se/Dotter.html; GATK (v 3.3-0-g37228af), https://software.broadinstitute.org/gatk/; KAT comp, https://github.com/TGAC/KAT/; BUSCO (v 2), http://busco.ezlab.org/; Trimmomatic (v 0.36 PE), http://www.usadellab.org/cms/?page = trimmomatic; Bowtie2 (v 2.2.4), http://bowtie-bio.sourceforge.net/bowtie2/index.shtml; MACS2 (v 22.214.171.12431216), https://github.com/taoliu/MACS/; R (v 3.2.5), https://www.r-project.org/; gplots (v 3.0.1), https://cran.r-project.org/web/packages/gplots/index.html; bedtools (v 2.25.0), http://bedtools.readthedocs.io/en/latest/; kSamples (v 1.2-4), https://cran.r-project.org/web/packages/kSamples/index.html; ggbiplot (v 0.55), https://github.com/vqv/ggbiplot/; Tandem Repeats Finder, https://tandem.bu.edu/trf/trf.html; seqLogo, https://bioconductor.org/packages/release/bioc/html/seqLogo.html; RNAfold, http://rna.tbi.univie.ac.at/cgi-bin/RNAWebSuite/RNAfold.cgi; UniProt/Swiss-Prot, http://www.uniprot.org/; dammit!, https://dammit.readthedocs.io/en/refactor-1.0/; Transfuse, https://github.com/cboursnell/transfuse/; GMAP, http://research-pub.gene.com/gmap/; Trim Galore!, https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/; Kallisto, https://pachterlab.github.io/kallisto/; Sleuth, https://pachterlab.github.io/sleuth_walkthroughs/trapnell/analysis.html; All-vsl-all BLASTP (version 2.2.30+ ), https://blast.ncbi.nlm.nih.gov/Blast.cgi; MUSCLE (v 3.8.31), https://www.drive5.com/muscle/; HMMER suit (v 3.1b1 May 2013), http://hmmer.org/; FASTASEARCH (v 36.8.8), https://www.ebi.ac.uk/Tools/sss/fasta/; Integrative Genomics Viewer (IGV) (v 2.3.97), https://github.com/ssadedin/IGV-CRAM/; MEGA (v 7.0.18), https://www.megasoftware.net/; RAxML (v 8.2.11), https://sco.h-its.org/exelixis/web/software/raxml/index.html; Burrows-Wheeler aligner (v 0.7.15), http://bio-bwa.sourceforge.net/; Samtools (v 1.3), http://www.htslib.org/; Geneious (v 10.2.3), https://www.geneious.com/; Coancestry, https://www.zsl.org/science/software/coancestry/; PLINK (v 1.07), http://zzz.bwh.harvard.edu/plink/.
A full description of the Methods can be found in the Supplementary Note. No statistical methods were used to predetermine sample size.
Genome sequencing and assembly of the koala reference genome
Samples were obtained as part of veterinary care at the Port Macquarie Koala Hospital and Australia Zoo Wildlife Hospital, and from the Australian Museum Tissue Collection. Sample collection was performed in accordance with methods approved by the Australian Museum Animal Ethics Committee (permit numbers 11–03 and 15–05). “Pacific Chocolate” (Australian Museum registration M.45022), a female from Port Macquarie in northeast New South Wales, was sampled immediately after euthanasia by veterinary staff at the Port Macquarie Koala Hospital (27 June 2012), following unsuccessful treatment of severe chlamydiosis. Two koalas from southeast Queensland—a female, “Bilbo” (Australian Museum registration M.47724), from Upper Brookfield, and a male, “Birke”, from Birkdale—were sampled following euthanasia due to severe chlamydiosis (20 August 2015) and severe injuries (26 August 2012), respectively. High molecular weight (HMW) DNA was extracted from heart tissue for Pacific Chocolate and kidney tissue for Birke using the DNeasy Blood and Tissue kit (Qiagen), with RNaseA (Qiagen) treatment. HMW DNA from Bilbo was extracted for PacBio sequencing from spleen tissue using Genomic-Tip 100/G columns (Qiagen), DNA Buffer set (Qiagen) and RNaseA (Qiagen) treatment. Fifteen SMRTbell libraries were prepared (RCG) as per the PacBio 20-kb template preparation protocol, with an additional damage repair step performed after size selection. A minimum size cutoff of 15 or 20 kb was used in the size selection stage using the Sage Science BluePippin system. The libraries were sequenced on the Pacific Biosciences RS II platform (Pacific Biosciences) employing P6 C4 chemistry with either 240 min or 360 min movie lengths. A total of 272 SMRT Cells were sequenced to give an estimated overall coverage of 57.3 × based on a genome size of 3.5 Gbp. A TruSeq DNA PCR free library was constructed with a mean library insert size of 450 bp. 400,473,997 paired-end reads were generated yielding a minimum coverage of 34 × . HMW gDNA was sequenced on an Illumina 150bpPE HiSeq X Ten sequencing run (Illumina)
An overlapping layout consensus assembly algorithm, FALCON (v 0.3.0) (see URLs), was used to generate the draft genome using PacBio reads. Total genome coverage before assembly was estimated by total bases from reads divided by 3.5 Gbp genome size. The estimated total coverage is 57.3 × . FALCON leverages error-corrected long seed reads to generate an overlapping layout consensus representation of the genome. Approximately 23 × of long reads are required by FALCON as seed reads, and the rest are used for error correction. The seed read length of the reads at the 60% percentile was calculated as 10,889 bp. The FALCON assembly was run on Amazon Web Service Tokyo region using r3.8xlarge spot instances as compute node, with the number of instances varying from 12 to 20 depending on availability.
After filtering low-quality and duplicate reads, approximately 57.3-fold long-read coverage was used for assembly. The primary contigs from the FALCON v 0.3.0 assembly (representing homozygous regions of the genome) yielded genome version phaCin_unsw_v4.1. This comprised 3.19 Gb, including 1,906 contigs with an N50 of 11.6 Mb and sizes ranging up to 40.6 Mb. The heterozygous regions of the genome (representing the alternative contigs from the assembly) were a total of 230 Mb, with an N50 of 48.8 Kb (Supplementary Table 2). Approximately 30-fold coverage of Illumina short reads was used to polish the assembly with Pilon86.
BUSCO analysis on the draft assembly was run against the mammalian ortholog database with the –long parameter on all genomes under comparison. This initial analysis showed the assembly only reached about 60% of genome completeness, suggesting a high number of indels in the draft genome. The genome polishing tool Pilon86 was employed to improve draft assembly from FALCON. About 30 × of 150 bp paired-end Illumina X Ten short reads from Bilbo was used as an input for this polishing process, which was run on a compute cluster provided by Intersect Australia Limited.
We implemented the method of Deakin et al.18 for super-scaffolding. Briefly, tables of homologous genes were generated using the physical order of genes on the chromosomes of gray short-tailed opossum and tammar wallaby as references and koala phaCin_unsw_v4.1 (Bilbo) as target (Supplementary Table 4).
Analysis of centromeric regions and repeat structure
Repeat content was called using RepeatMasker with combined RepBase libraries (v 2015-08-07) and RepeatModeller calls generated from the genome assemblies. The resulting calls were then filtered using custom Python scripts to remove short fragments (see “Code availability”) and combine tandem or overlapping repeat calls. To characterize the centromeric regions of the genome, chromatin immunoprecipitation (ChIP) was performed using the Invitrogen MAGnify Chromatin Immunoprecipitation System (Revision 6). Repeat content of the centromeric regions was determined using RepBase annotated marsupial repeats and output from RepeatModeller analysis of koala. RepeatMasker was used to locate repeats. Candidate centromeric segments were identified using two sliding window analyses, with window sizes of 200 kb and 20 kb and step sizes of 100 kb and 10 kb, respectively. Small tandem repeats were discovered in koala RSX sequence using the Tandem Repeat Finder program87, using +2, –3, and –7 as scores for match, mismatch and gap opening, respectively. Alignments of consensus repeat units with the RSX sequence were processed to obtain nucleotide frequency at each position.
Genome annotation and gene family analysis
Annotations were generated using the automated genome annotation pipeline MAKER88,89]. We masked repeats in the assembly by providing MAKER with a koala-specific repeat library generated with RepeatModeler90, against which RepeatMasker (v 4.0.3)91 queried genomic contigs. Gene annotations were made using a protein database combining the UniProt/Swiss-Prot92 protein database, all sequences for human (Homo sapiens), gray short-tailed opossum (Monodelphis domestica), Tasmanian devil (Sarcophilus harrisii) and tammar wallaby (Notamacropus eugenii) from the NCBI protein database93, and a curated set of marsupial and monotreme immune genes94. We downloaded all published koala mRNAseq reads from SRA (PRJNA230900, PRJNA327021) and reassembled de novo male, female and mammary transcriptomes using the default parameters of Trinity v 2.3.295. Each assembly was filtered such that contigs accounting for 90% of mapped reads were passed to MAKER as homologous transcript evidence. Ab initio gene predictions were made using the programs SNAP96, Genemark97 and Augustus98. Three iterative runs of MAKER were used to produce the final gene set.
Gene families were called using NCBI Blast (2.3.0) OrthoMCL (2.0.9)99. The protein sequences of genes belonging to orthogroups identified by OrthoMCL were aligned using MAFFT (7.2.71)100 and the gene tree was inferred using TreeBeST (1.9.2)101 providing a species tree to guide the phylogenetic reconstruction. Custom scripts (see “Code availability”) were applied to identify families with expansion within the koala, Diprotodontia, Australidelphia and marsupial lineages.
Sequence evolution on specific gene families was conducted on the cytochrome P450 (CYP), vomeronasal receptor (V1R), olfactory receptor (OR), aquaporin and taste receptor genes (Supplementary Note). Genes involved in koala development and reproduction and lactation were also characterized (Supplementary Note). Koala MHC, TCR and IGG genes were annotated and analyzed for expression between diseased and healthy animals (Supplementary Note). Evidence of selection across CYP and V1R genes was evaluated (Supplementary Note) using multispecies alignments (N = 152 and 8 sequences, respectively) in HyPhy102, hosted by the Datamonkey webserver103.
RNA-seq analysis of koala conjunctival tissue samples
Conjunctival tissue samples were collected from 26 koalas euthanized due to injury or disease by veterinarians at Australia Zoo Wildlife Hospital, Currumbin Wildlife Hospital and Moggill Koala Hospital. The collection protocol was approved by the University of the Sunshine Coast Animal Ethics Committee (AN/S/15/36). Health assessments of the eye were performed by an experienced veterinarian and classified as either ‘healthy’ (N = 13) or ‘diseased’ (N = 13) based on evidence of gross pathology consistent with ocular chlamydiosis55. Conjunctival tissue samples from each animal were placed directly in RNALater (Qiagen, Germany) buffer overnight at 4 °C before storing at –80 °C for later use. RNA was extracted using an RNeasy Mini Kit (Qiagen, Germany) according to the manufacturer’s instructions, with an on-column DNase treatment to eliminate contaminating DNA from the sample. The concentration and quality of the isolated RNA was determined using a NanoDrop ND-1000 160 Spectrophotometer and Agilent BioAnalyzer (Agilent, USA). Library construction and sequencing were performed by the Ramaciotti Centre (UNSW, Kensington, NSW) with TruSeq stranded mRNA chemistry on a NextSeq500 (Illumina, USA). Reads were mapped to the phCin_unsw_v4.1 assembly using the default parameters of STAR104 and counts summed over features using featureCounts105. Differentially expressed genes were called using DESeq2106 as implemented in the SARTools package107.
Koala retrovirus (KoRV)
We searched for KoRV sequences within the scaffolds of the phaCin_unsw v4.1 assembly of the Bilbo genome sequence, and also within alternative contig sequences before their correction by Pilon (since we noticed that in a few cases KoRV sequences were removed in the course of the sequence polishing process). KoRV sequences were found by using the program blastn108 to search with KoRV genome reference sequences (GenBank AF151794 and AB721500). Search results were converted to BED format and the KoRV and recKoRV components of each read were merged with the program mergeBed. KoRV insertions within genes were identified using the program intersectBed109. Pre-integration allelic sequences were found by using blastn108 to search the phaCin_unsw v4.1 genome sequence assembly with sequences flanking KoRV/recKoRV integrations as queries. In two cases the expected allelic sequence was not present in the Bilbo genome, but was found by searching the genome of another koala (Pacific Chocolate). To check the expected relationship between pairs of allelic sequences, we inspected dot plot alignments of representative sequences (not shown) created with the program dotter110.
Koala population genomics: historical population size
Demographic history was inferred from the diploid sequence of each of the three koalas, using a pairwise sequential Markovian coalescent (PSMC) method65. We conducted a range of preliminary analyses and found that PSMC plots were not sensitive to the values chosen for the maximum number of iterations (N), the number of free atomic time intervals (p), the maximum time to the most recent common ancestor (t), and the initial value of ρ. Based on these investigations, our final PSMC analyses of the three genome sequences used values of N = 25, t = 5, ρ = 1 and p = 4 + 25 × 2 + 4 + 6. The number of atomic time intervals is similar to that recommended for analyses of modern human genomes65, which are similar in size to the koala genomes. We determined the variance in estimates of Ne using 100 bootstrap replicates. Replicate analyses in which we varied the values of p, t and ρ produced PSMC plots that were broadly similar to those using our chosen ‘optimal’ settings (Supplementary Fig. 10).
The plots of demographic history were scaled using a generation length of 7 years, corresponding to the midpoint of the range of 6 to 8 years estimated for the koala111 and the midpoints of the estimates of the human mutation rate (1.45 × 10−8 mutations per site per generation; summarized by ref. 112) and mouse mutation rate (5.4 × 10−9 mutations per site per generation113) were applied in the absence of a mutation rate estimate for koala (Supplementary Fig. 10). The koala mutation rate is likely to be closer to that of humans, based on greater similarity in genome size, life history, and effective population size, relative to mouse112.
Koala population genomics: contemporary population analysis
Forty-nine koalas were sampled throughout the distribution using a hierarchical approach to allow examination of genetic relationships at a range of scales, from familial to range-wide. All individuals were sequenced using a target capture approach described in ref. 114, with a kit targeting 2,167 marsupial exon sequences. Illumina sequence reads were quality-filtered and trimmed (see ref. 114 for details) and mapped to the koala genome (Bowtie2, v2.2.4115). A panel of 4,257 SNP sites was identified (using GATK version 3.3-0-g37228af116) that showed expected levels of relatedness and differentiation among the sampled individuals. A panel of 1,200 SNPs (obtained by mapping to targets, filtering, and selecting one SNP per target) showed fine-scale regional differentiation consistent with evolutionary history and recent population management (Fig. 3).
Statistics and reproducibility
In Fig. 1e, points shown indicate the mean empirical Bayes factor (EBF) for sites under selection; error bars, 95% confidence interval. In Fig. 1f–h, 95% confidence intervals are calculated as 1.96 × s.e.m. (sample size is sequence depth, as indicated by red bars in Fig. 1c).
In Fig. 3c, center lines indicate median and box limits indicate upper and lower quartiles. Upper whisker = min(max(x), Q_3 + 1.5 × IQR), lower whisker = max(min(x), Q_1 – 1.5 × IQR); i.e., upper whisker = upper quartile + 1.5 × box length, lower whisker = lower quartile – 1.5 × box length. Circles indicate outliers. Linear modeling indicated that mean F differed significantly between several regions (Midcoast New South Wales–Southern Australia, P = 0.000524; Queensland–Southern New South Wales, P = 0.00237; Queensland–Southern Australia, P = 0.00000107; Southeast Queensland–Southern Australia, P = 0.006596).
Further information on experimental design is available in the Nature Research Reporting Summary linked to this article.
(1) Custom scripts to identify gene families with expansion within the koala, Diprotodontia, Australidelphia and marsupial lineages; (2) custom scripts to identify refined repeat calls; and (3) code used to generate SNP genotypes from exon capture data are available at https://github.com/DrRebeccaJ/KoalaGenome.
The Phascolarctos cinereus BioSamples are as follows: Bilbo 61053, SAMN06198159; Pacific Chocolate, SAMEA91939168; Birke. SAMEA103910665. Koala Genome Consortium Projects for the Koala Whole Genome Shotgun project and genome assembly are registered under the umbrella BioProject PRJEB19389 (union of PRJEB5196 and PRJNA359763).
Transcriptome data are submitted under PRJNA230900 (adrenal, brain, heart, lung, kidney, uterus, liver and spleen) and PRJNA327021 (milk and mammary gland). Illumina short-read data for Birke is submitted under PRJEB19982.
The Bilbo 61053 assembly described in this paper is version MSTS01000000 and consists of sequences MSTS01000001–MSTS01001906. For the Bilbo assembly Illumina X Ten reads are submitted under PRJEB19457 and PacBio reads under PRJEB19889.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
R.N.J. and the Australian Museum acknowledge the Australian Museum Foundation, Bioplatforms Australia, New South Wales Environmental Trust grant 2014/RD/0015, Australian Research Council LIEF Grant LE160100154, the University of Sydney HPC service and Amazon Web Services for support; and C. Staples from Featherdale Wildlife Park, C. Flanagan from Port Macquarie Koala Hospital, J. Hangar, E. Hynes, J. Reed, S. Ingleby, A. Divljan and S. Ginn for assistance with sample acquisition. K.B. acknowledges support from the Australian Research Council and Bioplatforms Australia. M.R.W. and the Ramaciotti Centre for Genomics acknowledge support from the Australian Research Council, from the Australian Government NCRIS scheme via Bioplatforms Australia, the New South Wales State Government RAAP scheme and the University of New South Wales. W.H. and W.J.N. were supported by strategic BBSRC funding (Institute Strategic Programme Grant BB/J004669/1) and by the NBI Computing Infrastructure for Science (CiS) group. A.D.G., K.M.H and K.T. were supported by grant R01GM092706 from the National Institute of General Medical Sciences (NIGMS) and A.D.G. had additional support from Morris Animal Foundation grant D14ZO-94. T.N.H., Z.D. and R.J.O. were supported by awards from the National Science Foundation 1613806 and the facilities within the Center for Genome Innovation at the University of Connecticut. C.E.H. thanks CSIRO National Research Collections Australia funding. K.B. and A.P. thank the veterinary staff at Australia Zoo Wildlife Hospital, Currumbin Wildlife Hospital and Moggill Koala Hospital for their assistance in the collection of samples for the koala conjunctival transcriptome study. T.H. acknowledges the Kyoto University Research Administration Office (KURA) for support and was financed by JSPS KAKENHI grant number 16K18630 and the Sasakawa Scientific Research Grant from the Japan Science Society. A.P. and P.T. acknowledge financial support from the Australian Research Council and A.G. financial support via Australian Research Council Discovery Grant DP110104377. C.M.W. is supported by a University of Sydney research fellowship from the estate of Mabs Melville. All authors thank Bioplatforms Australia and Pacific Biosciences. The authors thank T. Haydon for valuable editorial input; S. Potter for expert technical assistance; and R. Gleadow, C. Frere, D. Lunney and D. Alvarez-Ponce for valuable discussions on content.