Genome analysis of three Pneumocystis species reveals adaptation mechanisms to life exclusively in mammalian hosts

Pneumocystis jirovecii is a major cause of life-threatening pneumonia in immunosuppressed patients including transplant recipients and those with HIV/AIDS, yet surprisingly little is known about the biology of this fungal pathogen. Here we report near complete genome assemblies for three Pneumocystis species that infect humans, rats and mice. Pneumocystis genomes are highly compact relative to other fungi, with substantial reductions of ribosomal RNA genes, transporters, transcription factors and many metabolic pathways, but contain expansions of surface proteins, especially a unique and complex surface glycoprotein superfamily, as well as proteases and RNA processing proteins. Unexpectedly, the key fungal cell wall components chitin and outer chain N-mannans are absent, based on genome content and experimental validation. Our findings suggest that Pneumocystis has developed unique mechanisms of adaptation to life exclusively in mammalian hosts, including dependence on the lungs for gas and nutrients and highly efficient strategies to escape both host innate and acquired immune defenses.


Supplementary Figure 2. Validation of P. murina genome assembly by CHEF and Southern blotting.
(a and b) show consecutive hybridizations with two different blots. Lane EB, agarose gel stained by ethidium bromide. Probes from single-copy genes specific for each scaffold are indicated by a 4-digit number above each lane, representing the last four digits of the gene ID (for example, 0359 for gene PNEG_00359). The number at the bottom in each panel represents the scaffold number. When using these probes, only a single band was detected for each probe, with a size expected for the corresponding scaffold. When the scaffolds were ordered by length, they were all consistent with the order of chromosomes by length. The hybridization bands for two sets of scaffolds (3 and 4, and 16 and 17) were not well separated but could be distinguished by overlapping developed X-ray films. Probe 9-11-16 contains a DNA fragment shared among three scaffolds (nos. 9, 11 and 16 boxed in red). Probe msg contains the putative conserved recombination junction element (CRJE), which is highly conserved among classical msg-A1 members but absent in other msg members. Signals of hybridization with the msg probe were detected on all chromosomes except for chromosomes (or scaffolds) 6 and 12, which are the only two scaffolds without msg-A1 sequences. In contrast to a previous CHEF study 1 in which the kexin gene was located on chromosome 6 which did not hybridize with an msg probe, we found it on scaffold/chromosome 5 (by sequencing as well as CHEF hybridization using probe 0129), which also showed strong hybridization with the msg probe. These observations suggest a possible misalignment of the DNA band in the previous report, or alternatively P. murina strain variation. Probe STR contains a subtelomeric repeat sequence from the 5' end of scaffold 9 as shown in Supplementary Fig. 3a.  (d) Schematic representation of three P. murina scaffolds (6, 7 and 15) and three P. carinii scaffolds (1, 7 and 17). Homologous regions between P. murina and P. carinii scaffolds are indicated in identical colors, and are connected by the grey shading. All rearrangements were confirmed by hybridization (data not shown). (e) An example of a genomic region showing gene duplication or deletion and chromosomal rearrangement among three Pneumocystis species. A tandem 6-gene cluster (shown in purple arrows) is identified in the P. murina chromosome 17 (PmChr17) and belongs to the msg-C family (Fig. 3). Only one partial copy of this family is present in P. carinii in scaffold 16 (PcS3.16) in the homologous region while two orthologs in P. jirovecii are located in two different scaffolds (not shown). The regions flanking the 6-gene cluster of P. murina are conserved in all three species, except that P. jirovecii (PjS2.14) has an inversion (from genes T551_03089 to T551_3102) relative to the right side of the P. murina tandem cluster, as indicated by the dashed grey line.
Supplementary Figure 5 Determination of Pneumocystis rDNA copy number.
(a) Illumina sequence read depth for the rDNA locus in each Pneumocystis species. The Illumina reads used for this analysis are available from the NCBI Sequence Read Archive with accession SRR770457 for P. murina B123, SRR1043727 for P. carinii B80 and SRR1043749 for P. jirovecii RU7. The read depth was determined by the number of total aligned reads multiplied by the read length (101 base) divided by the length of the rDNA locus or whole genome. While there was S. cerevisiae rDNA sequence contamination in P. murina and P.
jirovecii, only reads matching these two species at high identity were used for this calculation.   Phylogenetic relationship is inferred from 413 single copy core orthologs. Ortholog conservation patterns highlighted include: core orthologs found in all genomes ('Core'); Ascomycete specific found in all Ascomycetes but not in the Microsporidia ('Ascomycete'); Pneumocystis specific; 'Missing', found in all other genomes but absent in one genome; 'Shared', present in any two or more genomes; and 'Unique', found in only one genome. The scale at the bottom indicates the number of orthologous genes.

Expansion of the M16 and kexin peptidase families in Pneumocystis.
(a) Phylogenetic tree of M16 peptidases built using all genes with hits to Pfam domains PF05193 and PF00675, showing that most of the ortholog groups (indicated by the colored vertical bars) are conserved in most of the analyzed species, and most species have a single copy in each ortholog cluster, except as noted below. Pneumocystis species have expansions of the ortholog group (at the bottom) that includes S. cerevisiae genes Ste23 (YLR389C) and Axl1 (YPR122W). P. jirovecii has two copies, P. murina has five copies and P. carinii has six copies.
The function of each gene family is based on the S. cerevisiae genes (red triangles) and  The tree includes all Pneumocystis genes containing a Pfam PF05730 domain. Each Pneumocystis species (blue squares for P. murina, pink circles for P. carinii and green diamonds for P. jirovecii) has 5 members of the CFEM domain-containing gene family (two to three were not included in Fig. 2 due to a higher homology cut-off level used in Fig. 2); each gene has one to six CFEM domains. Other pathogenic species also have expansions of this family, including A. fumigatus (4 copies with "Afu" as the prefix in the gene ID), C. immitis (8 copies with "CIMG" as the prefix), and C. albicans (6 copies with "CAL" or "CAF" as the prefix). S. cerevisiae has only one gene (YLR390W-A) and Schizosaccharomyces species do not have any. GPI anchor was predicted using the PredGPI (http://gpcr.biocomp.unibo.it/predgpi/pred.htm), big-PI (http://mendel.imp.ac.at/gpi/fungi_server.html) and KohGPI (http://gpi.unibe.ch/) programs, with + and -indicating the presence or absence of a GPI anchor for each method, respectively. Signal peptide was predicted using the SignalP 4.1 server 2 . Transmembrane domains were predicted using the TMpred server (http://www.ch.embnet.org/software/ TMPRED_form.html); the number of transmembrane domains is indicated. In C. albicans, CFEM proteins are involved in scavenging iron from heme and hemoglobin 3,4 .

Supplementary Note 1: Confirmation of duplicated regions and telomere repeats.
To verify three sets of duplicated gene cassettes in the end of 7 scaffolds in P. murina ( Supplementary Fig. 1a), we performed contour clamped homogeneous electrical field (CHEF) electrophoresis and hybridization with probes from a region shared within each duplicated set.
Two or three bands were detected for each probe, with a size expected for the corresponding scaffolds. An example of these hybridization experiments is shown in Supplementary Fig. 2a.
These experiments confirmed that copies of these gene cassettes are present on multiple scaffolds. CHEF electrophoresis and hybridization were also used to verify the telomere/subtelomere repeats in P. murina. When using an oligonucleotide probe containing 9 copies of the telomere repeat unit TTAGGG, hybridization signal was detected on all visualized chromosomes ( Supplementary Fig. 3b). When using a DNA probe amplified from P. murina genomic DNA, which contained a mixture of three different repeat units TTATGC, TTAGGC and TTAGGG ( Supplementary Fig. 3a), hybridization occurred predominantly with 4 chromosomes corresponding to scaffolds 6, 9, 12 and 17 ( Supplementary Fig. 2b) When using the P. murina telomere/subtelomere repeats to blast Illumina raw reads from P. carinii and P. jirovecii, a large number of reads was found for the repeat unit TTAGGG (linked to Pneumocystis-specific sequences) while no reads were found for other repeat units (TTAGGC, TTATGC, and GTAGGC) in either species. These data support TTAGGG as the telomere repeat unit in all three Pneumocystis species.
We were unable to perform CHEF studies with P. carinii or P. jirovecii since CHEF analysis requires fresh, heavily infected lung tissues, which are currently unavailable to us.

Supplementary Note 2: Confirmation of chromosomal rearrangements.
We found rearrangements of ~60-260 kb segments in five P. carinii scaffolds compared to the P. murina assembly ( Fig. 1a; Supplementary Fig. 4). All rearrangements were first analyzed by PCR using primers covering the rearranged regions. Sequencing of the PCR products showed sequences perfectly consistent with the scaffold assemblies. In addition, we performed Southern blotting analysis using CHEF-separated P. murina chromosomes as described above and restriction fragments of P. carinii genomic DNA as described in our previous report 16 . Representative hybridization results are shown in Supplementary Fig. 4b, c.
All rearrangements were supported by hybridization results.
By contrast, the P. jirovecii assembly showed more extensive rearrangements, both  Fig 2b). Furthermore, by PCR targeting a 9-11 kb fragment covering the entire rDNA and a part of its upstream and downstream protein-coding genes in each Pneumocystis species, we amplified a single band with the expected size and terminal sequence ( Supplementary Fig. 5b). In addition to the rDNA locus, each Pneumocystis has two copies of 5S rRNA genes present as a tandem repeat (separated by an intergenic region) in a different chromosome than the rDNA locus.
Comparative analysis indicates Pneumocystis have the smallest number of rRNA genes among fungi, similar to Taphrina deformans, which is a slow growing fungal pathogen in plants 9 and which also contains only a single rRNA operon 9 and two copies of 5S rRNA genes (identified from the sequences deposited in NCBI database). In contrast, other eukaryotes, including the intracellular Microsporidia 15 , as well as many bacteria, contain multiple copies (up to tens of thousands) of rDNA per genome 20,21 . For example, S. cerevisiae and S. pombe contain a total of ~560 and ~450 rRNA genes, respectively, based on an estimated 140 copies of the rRNA gene cassette present in the genomes of S. cerevisiae 12,22,23 and S. pombe 10 ; each cassette includes 4 genes for S. cerevisiae and 3 genes for S. pombe. The total for S. pombe also includes 5S rRNA genes that are located outside these cassettes 10 (Table 1).
In each Pneumocystis species, a total of 45-47 tRNA genes were identified by tRNAScan-SE 24 , similar to that in Microsporidia 15 but much less than that in other fungi and eukaryotes (170-570 copies) 25 .
It is generally assumed that rDNA redundancy allows the cell to maintain a functional ribosome in diverse environments and that a higher rDNA copy number allows for an increased rate of rRNA synthesis, thus a higher level of ribosome production and more rapid growth 26 .
The reduction to a single rDNA copy in Pneumocystis may reflect its genome stability as a result of adaptation to a stable environment 27,28 . The single rDNA copy together with a minimal number of tRNA genes as well as an extremely low GC content in Pneumocystis also suggests a slow transcription and translation machinery 21,26,29 , which presumably reflects a response to resource availability (e.g. a large amount of energy and resources used for intron processing,   Targeted sequencing of msg-containing subtelomeric regions in different P. murina and P. carinii isolates showed no sequence variation between isolates, however different msg sequences were found immediately downstream of and in frame with the single copy UCS sequence required for expression 1,44 . These observations support a gene conversion mechanism for msg recombination, and no inter-strain variation of the msg repertoire in laboratory strains of P. murina and P. carinii, as has been suggested from previous studies 37 . In contrast, the P. jirovecii msg repertoire shows extensive inter-strain variation 37 , and subtelomeres are regions of high diversity between strains (Fig. 1b). RNA-Seq data indicate that all msg genes in P. murina were noted previously 8,68,70,71 and in our initial analysis, we were extra cautious in defining a gene loss. All gene losses in the current report were verified by repeated blast analysis of the annotated protein sets, assembled nucleotide sequences and raw reads we generated, as well as the previously reported sequences for P. carinii 7 and P. jirovecii 8 . Of note, none of the genes reported as lost in the current study were identified as being present in the previously reported assemblies 7,8 .
We found that Pneumocystis genomes encode all the enzymes required for GPI anchor synthesis except for one, GPI-protein-acyl-transferase (Cwh43), which is highly conserved within Ascomycota, and in S. cerevisiae 72 is responsible for replacing the diacylglycerol moiety of the nascent GPI-anchor moiety with ceramide (Supplementary Data 19). These findings suggest that Pneumocystis produces GPI anchored proteins which do not contain ceramide as has been found in some GPI-anchored proteins in S. cerevisiae 73 .
The gene encoding transaldolase of the pentose phosphate pathway (PPP), which is absent in the previous P. carinii genome assembly 69 , is also absent in the current P. carinii and P. murina genomes but present in P. jirovecii as well as all other ascomycetes analyzed. PPP is an important source of NADPH for anabolic reactions, and of sugar molecules required for the biosynthesis of nucleic acids and amino acids. While it is possible that the transaldolase gene in rodent Pneumocystis could not be identified based on sequence homology due to a low degree of sequence conservation 74 , this gene is also missing in Plasmodium falciparum 75 and Cyanidioschyzon merolae 76 . It is unclear if the transaldolase activities in rodent Pneumocystis could be replaced by other enzymes 77 .
All three Pneumocystis species lack de novo synthesis pathways and direct transport systems for phosphatidylinositol, phosphatidylcholine, inositol and choline ( Fig. 4; Supplementary Data 12). Nevertheless, these metabolites could be supplied by alternative mechanisms. Each Pneumocystis genome encodes the Dnf1-Lem3 flippase complex involved in uptake of external lyso-phosphatidylcholine that can be converted to phosphatidylcholine and subsequently to choline. We also identified a potential transporter (Git1) involved in uptake of external glycerophosphoinositol that could be hydrolyzed into inositol 78 , though the enzyme responsible for this hydrolysis has not been definitively identified in Pneumocystis or other fungi. Of note, the downstream pathways to make phosphatidylinositol and various inositol phosphate compounds are fully conserved in each Pneumocystis species.
A recent report identified inositol transporters in Pneumocystis 5 , including genes identical to PNEG_01850 and PNEG_00943 in P. murina, T552_02044 and T552_00744 in P.
carinii, and T551_02057 in P. jirovecii in this study. Based on our phylogenetic analysis of these genes as well as another closely related but previously unidentified gene (T551_02439 in P. jirovecii) 5 , these genes are more closely related to glucose transporters than inositol transporters ( Supplementary Fig. 9a).

It is possible that inositol transporters have been lost in
Pneumocystis. Inositol is a critical compound needed to synthesize phosphatidylinositol and various inositol phosphates. Pneumocystis may rely on an alternative mechanism to obtain inositol as shown in Fig. 4 and discussed above.
As previously reported 68  is often a contaminant, though it was seen in both P. carinii digests.

Supplementary Note 7: Msg protein glycosylation identification.
Genome analysis suggests that, unlike other fungi, Pneumocystis cell wall proteins are not highly mannosylated ( Fig. 7; Supplementary Data 21). Since it is not practical to isolate Pneumocystis cell wall proteins free of host proteins due to the lack of a reproducible culture method, and Msg is the most abundant surface protein and the only surface protein that has been well characterized in Pneumocystis, we chose to examine glycosylation in P. carinii Msg proteins that were affinity purified using an anti-Msg monoclonal antibody.
By liquid chromatography-tandem mass spectrometry (LC-MS/MS) analysis, both the released N-linked profiling and glycopeptide mapping data suggest that M5N2 is a predominant N-glycan component on P. carinii Msg (Fig. 7c, d; Supplementary Fig. 13; Supplementary Data 21). The data suggest that M6-M9N2 are also present but the relative abundance of those species are very low compared to M5N2 (Supplementary Fig. 13). Nothing greater than M9N2 was definitively identified. The abundance of those minor species were slightly above-the-noise level in the released N-linked profiling, approaching the limit of detection of mass spectrometry.

Identification of msg and kexin genes in subtelomeric regions.
In initial genome assemblies for all three Pneumocystis species, only partial msg genes were present in most scaffold ends, except for the msg clusters present in six P. carinii scaffolds and one P. jirovecii scaffold which were obtained by merging with the previously reported telomere ends 36,83 . Although a large number of raw reads mapped to msg genes, they were under-represented in the scaffolds due to high-level sequence identity between different msg genes (80-99%), which made accurate assembly difficult.
Identification of msg genes by anchored PCR : Based on known msg sequences (msg-A1 subfamily, Fig. 3) we designed primers from highly conserved regions, including the putative conserved recombination junction element (CRJE) 1 at the 5' end and a region at the very 3' end.
These primers were paired with primers specific for the ends of the targeted scaffolds. All PCR products were subjected to Sanger sequencing; the resulting sequences were added to the targeted scaffolds. This approach extended partial msg genes, and identified new msg genes at the ends of most scaffolds in P. murina and P. carinii, including three sets of tandem arrays of 2 or 3 genes (5-10 kb with 95-99% identity), with each set duplicated in 2 or 3 scaffolds in P.
Identification of msg and kexin genes by PacBio sequencing : Following a preliminary study of the ability of PacBio sequencing to provide long reads (>1.5 kb) with high accuracy (~99%) for msg genes 84 , we utilized PacBio sequencing to identify classical msg genes (msg-A1 subfamily, Fig. 3) in all three Pneumocystis species, msr genes (msg-A2 subfamily, Fig. 3 Gene prediction. The P. murina and P. carinii genomes were annotated using a combination of expression data (RNA-Seq), homology information, and ab initio gene finding methods as previously described 85 . RNA-Seq reads were aligned to the genome using BWA 86  These methods were also used to annotate the RU assembly of P. jirovecii from the patient, with the exception that RNA-Seq data was not used. Initial gene sets for the three species were compared to evaluate the consistency of orthologs across the three Pneumocystis species, using OrthoMCL 95 to assign orthologs across the three gene sets. Where gene calls appeared to be missing in one or two genomes, we examined all evidence from the individual gene prediction algorithms and RNA-Seq alignments to identify candidate gene models, and where possible added the best model to the gene set. In addition, we examined cases where genes appeared to be missing due to split or merged gene calls; these were corrected where possible. This comparative process resulted in a highly consistent and complete gene set for the three genomes.
Probable mobile elements were removed from the gene set using multiple methods.
Predicted genes in conserved repetitive elements were identified using TPSI (http://transposonpsi.sourceforge.net), presence of Pfam domains known to occur in repetitive elements, and a BLAST run against a locally maintained repeat library. Additional repeats were identified using a BLAT alignment of the draft gene set to the genome, requiring 90% nucleotide identity over at least 100 bases; genes matching the genome more than seven times were removed as probable repeats. To ensure this process was not too aggressive, no genes were removed if they included non-repeat Pfam domains.
The final gene set was produced after checking the repeat-filtered updated gene set for inframe stops, CDS overlaps, multiple Ns in exons, exons ≥ 1500 nt and introns ≤ 20 nt, and corrected manually as needed. The gene sets in P. carinii and P. jirovecii include all Pneumocystis-specific predicted CDSs in previously reported assemblies (Supplementary Table   1 To infer the phylogenetic relationship of Pneumocystis and related fungi, we identified 413 single copy core orthologs using OrthoMCL 95 in the 12 genomes in our comparative set (Supplementary Table 2). Proteins from each ortholog group were aligned with MUSCLE 39 and the resulting alignments concatenated. A phylogeny was then inferred using RAxML (v7.7.8) 40 with model PROTCATWAG and 1,000 bootstrap replicates.
To identify syntenic regions of conserved gene order between the Pneumocystis genomes, we identified all blocks of three or more genes using DAGchainer 99 . Orthologous genes were identified using OrthoMCL and input to DAGchainer to find syntenic regions. In addition to Pneumocystis, we also identified syntenic regions with other related fungi in Taphrinomycotina; the block size was set at three to detect such regions. Between P. murina and S. pombe, we identified 86 syntenic regions encompassing 295 total genes (3.4 genes per block on average). Between P. murina and T. deformans we detected 36 regions encompassing 121 genes (3.4 genes per block on average); however as the T. deformans assembly is highly fragmented, this is likely an under-estimate of syntenic conservation. Syntenic conservation was plotted using functions in R.
The same phylogenetic methods described above were used to analyze genes involved in metabolic pathways. All trees were constructed using protein sequences. Unless otherwise specified in the tree, the protein names from all species shown are used as follows. The proteins from S. cerevisiae were indicated by their systematic names alone (seven characters starting with Y, e.g. YFL040W in Supplementary Fig. 9a) or followed with standard abbreviated names (e.g. YGL012W: Erg4 in Supplementary Fig. 11b). For all other species, the proteins were indicated by their locus tags available from NCBI or respective genome databases (Supplementary Table 2). The first two to five characters of the locus tags are species specific,