Investigation of inter- and intraspecies variation through genome sequencing of Aspergillus section Nigri

Aspergillus section Nigri comprises filamentous fungi relevant to biomedicine, bioenergy, health, and biotechnology. To learn more about what genetically sets these species apart, as well as about potential applications in biotechnology and biomedicine, we sequenced 23 genomes de novo, forming a full genome compendium for the section (26 species), as well as 6 Aspergillus niger isolates. This allowed us to quantify both inter- and intraspecies genomic variation. We further predicted 17,903 carbohydrate-active enzymes and 2,717 secondary metabolite gene clusters, which we condensed into 455 distinct families corresponding to compound classes, 49% of which are only found in single species. We performed metabolomics and genetic engineering to correlate genotypes to phenotypes, as demonstrated for the metabolite aurasperone, and by heterologous transfer of citrate production to Aspergillus nidulans. Experimental and computational analyses showed that both secondary metabolism and regulation are key factors that are significant in the delineation of Aspergillus species. De novo assembly of 23 Aspergillus section Nigri and 6 Aspergillus niger genome sequences allows for inter- and intraspecies comparisons and prediction of secondary metabolite gene clusters.

S pecies in the genus Aspergillus are of broad interest to medical 1 , applied 2,3 , and basic research 4 . Members of Aspergillus section Nigri ('black aspergilli') are prolific producers of native and heterologous proteins 5,6 , organic acids (in particular citric acid 2,7,8 ), and secondary metabolites (including biopharmaceuticals and mycotoxins like ochratoxin A). Furthermore, the section members are generally very efficient producers of extracellular enzymes 9,10 ; they are the production organisms for 49 out of 260 industrial enzymes 11,12 . Among the most important of these, in addition to A. niger, are A. tubingensis, A. aculeatus, and A. luchuensis (previously A. acidus, A. kawachii, and A. awamori [13][14][15] , respectively).
Members of Aspergillus section Nigri are also known as destructive degraders of foods and feeds, and some isolates produce the potent mycotoxins ochratoxin A 16 and fumonisins [17][18][19] . In addition, some species in this section have been proposed to be pathogenic to humans and other animals 20 . It is thus of interest to further examine section Nigri for industrial exploitation, as well as prevention of food spoilage, toxin production, and pathogenicity caused by these fungi.
A combined phylogenetic and phenotypic approach has shown that section Nigri contains at least 27 species [21][22][23][24][25] . Recent results have shown that the section contains species with high diversity and may consist of two separate clades: the biseriate species and the uniseriate species 26 , which show differences in sexual states 27 , sclerotium formation 28 , and secondary metabolite production 29 . In the section, only six species have had their genome sequenced: A. niger 2,8 , A. luchuensis 15,30 , A. carbonarius 31 , A. aculeatus 31 , A. tubingensis 31 , and A. brasiliensis 31 .
This section, with its combination of species richness and fungal species with a diverse impact on humanity, is thus particularly interesting for studying the diversification of fungi into species. In this study, we have de novo-sequenced the genomes of 20 species of section Nigri, thus completing a genome compendium of 26 described species in the section. Further, we have genome-sequenced three

Investigation of inter-and intraspecies variation through genome sequencing of Aspergillus section Nigri
Articles NaTurE GENETics additional A. niger isolates (including two previously described as species A. lacticoffeatus 10 and A. phoenicis 32 ), which in combination with the other analyses allows for inter-and intraspecies comparison of 32 isolates. The development of algorithms for comparative genomics, combined with experimental analysis of the species, allows us to track genetic diversity across genomes, from the protein level, over the evolution of biosynthetic gene clusters, to the groups of genes that define clades or individual species. The high resolution in genome sequences allows us to characterize both species diversification and variation within species.

Results
New genomes show high genetic diversity of section Nigri. We present 23 whole-genome draft sequences: 20 genomes of section Nigri species previously unsequenced and 3 additional A. niger genomes for assessment of intraspecies diversity. All genomes were sequenced, assembled, and annotated using the Joint Genome Institute (JGI) fungal genome pipeline 33,34 (Supplementary Table 1; genomes were sequenced by either Illumina or Pacific Biosciences sequencing). Figure 1 shows a phylogenetic tree as well as gene richness, number of scaffolds, and functional annotation (InterPro 35,36 ). The tree supports previous proposals 10, 32 that A. lacticoffeatus and A. phoenicis are synonyms of A. niger.
In comparing key statistics of the genomes, we found that some traits are quite similar and others surprisingly variable. Many of the investigated species have around the average number of genes (11,900), but there is considerable variation from the smallest number of predicted genes (10,066) to the largest (13,687). The smallest number of predicted genes in section Nigri is found in A. saccharolyticus, which supports the previous observation 37, 38 that this species is quite atypical in section Nigri.
We further evaluated the annotation of the 23 genome sequences we generated. The percentage of complete genes (including a start and stop codon) is in the range of 94-98%, and 67% of the proteins could be assigned one or more InterPro domains. The number of scaffolds (average 166) varies from 47 in A. piperis to 518 in A. ellipticus. On average, 70% of the proteins had sequence homologs in Swiss-Prot (91% of proteins have homologs within section Nigri; see next section). This means that even though six members of section Nigri have already been sequenced, ~30% of the predicted gene models in each of the new genomes are not found to encode proteins with homologs in Swiss-Prot.
The pan-and core-genome shows genome flexibility. Given the genetic diversity in section Nigri, we were interested in examining the extent of genome diversification. For this analysis, we focused on three conceptual groups of genes: (1) The pan-genome: all genes present in one or more species.
(2) The core-genome: genes present in all included species,

NaTurE GENETics
including paralogs. This set is expected to encode cellular functions needed for all species.
(3) Species-unique genes: genes found in only one species in our analysis, with or without paralogs. Included in these, we would expect to find genes involved in environmental adaptation. This group can also include annotation errors.
We first identified orthologs and paralogs with a BLASTpbased pipeline using reciprocal hits according to cut-offs specifically selected here for the Aspergillus genus (Methods). Groups of homologous proteins are referred to as families. Figure 2a-c shows the overall genetic diversity between 38 fungal strains (32 species) from closely related genera (Fig. 2a), within the Aspergillus genus (36 of the 38 strains; Fig. 2b), and from section Nigri (32 of the 38 strains; Fig. 2c).
The Aspergillus genus pan-genome comprises 433,116 genes across the 36 Aspergillus genomes, and from this, 62,996 gene families were constructed. Of those families, 6% are found in all genomes (3,769 core families), while 9% are genes without orthologs in the other genomes (40,424 unique genes; 39,929 unique families) (Fig. 2b). We also found evidence of gene loss, duplication, and potential gene transfers between species of this section, as 23% of the pan-gene families are not present in groups of species fitting the phylogenetic tree (Supplementary Table 2). This is consistent with previous work reporting extensive horizontal gene transfer in Aspergillus 39 .
We further performed an analysis defining the number of coregene families in section Nigri and in all sub-clades thereof (Fig. 2d). The core-genome of section Nigri is 32% larger than that of the genus (4,983 families relative to 3,769; Fig. 2b,c). Conversely, 9% are unique to a specific species (32,378 unique genes in 32,036 families; Fig. 2c). The fraction of genes unique to a species is similar within the section and across the genus, meaning that adding a new section Nigri genome adds as many new genes as adding a more distantly related Aspergillus (within the analyzed group of species). This is rather interesting and shows a generally high genetic diversity of All fungal species (38) Aspergillus species (36) Aspergillus section Nigri ( diversity between fungal species from closely related genera, species within the Aspergillus genus, and 32 species within section Nigri. a-c, Histogram representation of the total number of genes and families distributed over the pan-genome, core-genome, and unique genome for the 38 fungal genomes of this study (a), the 36 Aspergillus genus genomes (b), and the 32 genomes in section Nigri (c). d, Dendrogram of the phylogenetic relation between the 32 species in section Nigri. The black nodes represent homolog families found only in the species branching from the nodes. The white boxes represent the genes unique to the specific species. e, Stacked histograms of the gene count, the core-genome genes, and unique genes for each species; values shown are total numbers of genes. Dark colors, InterPro annotation; light colors, no annotation.

NaTurE GENETics
genus Aspergillus. However, such a tendency could be also the result of overpredicting genes, considering the low rate of InterPro annotation in the unique genes (Fig. 2e).
The section Nigri core genome contains carbohydrate-active enzymes and secondary metabolism gene clusters. To associate biological functions to the pan-, core-, and unique genomes, and genes exclusive to only members of the black aspergilli, we employed the InterPro database 36 . An examination of the coregenome of 38 fungal genomes ( Fig. 2a and Supplementary Fig. 1) revealed that only 4.5% of the genes lack InterPro domains (Supplementary Table 3a), indicating-as would be expected-that the core-genes across closely related fungal genera include generally known and conserved functions. For the pan-genomes of the 36 Aspergillus species compared with the section Nigri species, the percentages of unknown function are similar (32% compared with 33%; Supplementary Tables 3d and 4a and Supplementary Fig. 2), as are the corresponding percentages for the core-genomes (14% compared with 17%, Fig. 2e; Supplementary Tables 3d and 4a). General functions like transporters, regulators, organelle-specific proteins, primary metabolism, and structural domains were found as core features across all 36 aspergilli (Supplementary Table 3f), which supports the general validity of the method.
We expected the section Nigri core-genome (gene families found in all the species of section Nigri but not in any other aspergilli examined) to contain Nigri signature genes, and we found this to be the case. These 1,214 gene families contain 580 InterPro domains conserved to a varying degree, including a many genes involved in the saprotrophic lifestyle and secondary metabolism (Supplementary Table 5). It is hypothesized that these genes are defining for the section compared with other aspergilli and will encode functions related to the phenotypes of species in this section.
Unique secondary metabolism genes in Aspergillus species. The genetic diversity seen in section Nigri led us to investigate whether the unique genes for each species show common trends in function. While these genes by definition do not have homologs in other species investigated in this work, we can predict general functions using InterPro domains. Unique genes of species in section Nigri matched 1,334 different InterPro domains (Supplementary Table 6a-c). Within the unique genes, we searched the list of InterPro domains in all sets of genes unique to individual section Nigri species (excluding the six A. niger isolates, to remove intraspecies redundancy). Surprisingly, we identified only ten domains that were found in nearly all Nigri species (25-26 species). Notably, nine of those are related to functions involved in secondary metabolism, gene regulation (transcription factors), or protein regulation (protein kinases) (Supplementary Table 7). Finding these functions in nearly all sets of species-specific genes suggests that secondary metabolite production and regulatory proteins are commonly identified as the species-'unique' genes and are therefore critical differentiates for fungal species at the genetic level.
Intra-and interspecies genetic variations are similar. We were interested in comparing the diversity between isolates of the same species to the diversity among species in the same clade. We thus compared six A. niger isolates to the eight closely related species in the A. tubingensis clade (Fig. 2d) Fructose-2,6-bisphosphatase 13 Fructose-bisphosphate aldolase 4 Triosephosphate isomerase 5 Glyceraldehyde-3-phosphate dehydrogenase 6 Phosphoglycerate kinase 7 Phosphoglycerate mutase 8 Phosphopyruvate hydratase (enolase) 9 Pyruvate kinase 10 Pyruvate carboxylase 11 Phosphoenolpyruvate carboxykinase (ATP) 12 b c Growth on CREA acid-indicator plates

Fig. 3 | Gene family sizes for central metabolism pathways from 32 section Nigri genomes and 6 reference species.
Proteins putatively catalyzing the same metabolic step (that is, complex subunits or functionally related enzymes) have been grouped in gray boxes. Species are sorted according to the phylogeny in Fig. 1. a, Glycolysis isoenzymes. b, Tricarboxylic acid cycle isoenzymes. c, Growth of strains on CREA medium, which turns yellow with lowered pH. A. niger CBS 513.88 has been mutagenized 2 .

NaTurE GENETics
degree of genetic homogeneity, as 80% of the A. niger pan-genome is conserved across the six isolates and only 6% is unique to any of the isolates ( Supplementary Fig. 3a). The same scale is seen in the A. tubingensis clade (77% shared pan-genome, 7% unique; Supplementary  Fig. 3b). Moreover, the percentage of genes with predicted functional domains within the two groups is similar to that within section Nigri as a whole (Supplementary Fig. 4  Extra citrate synthase genes confer increased citrate. As species of section Nigri are known organic-acid producers, the genes involved in central metabolism are of interest, particularly given that the cause of citric acid overproduction in several of the section members is still not identified 40 . We thus analyzed the number of paralogs in central carbon metabolism in our set of 38 fungal genomes using a curated version of an A. niger genome-scale metabolic model 41 as a source of pathway annotation (Fig. 3). The analysis of paralogs in glycolysis shows very little variance across the 32 Nigri genomes, similar to the variation in the 6 other fungal species genomes (Fig. 3a). For the tricarboxylic acid cycle (Fig. 3b), it is evident that certain metabolic steps in the pathway are conserved throughout all species, while others vary in paralog numbers. The biseriates are particularly homogeneous. These, along with four uniseriates, are also the primary citric acid-producing species in the section (Fig. 3c).
Of particular interest is the citrate production phenotype, and thus citrate synthase. All biseriates have one extra citrate synthase, and the four acid-producing uniseriates have two extra. Sequence alignment identified three distinct types ( Supplementary Fig. 5), two of which are mitochondrial and are found in all species. All extra citrate synthase paralogs are of the third type, predicted to be cytosolic. We identified the extra biseriate citrate synthase (citB 42 ) in a conserved 30 kB gene cluster including two transcription factor genes, a transporter gene, and two putative fatty acid synthase genes. We performed heterologous expression of the A. niger citB gene cluster in A. nidulans (which has only the two mitochondrial citrate synthases) using two constitutive promoters to control the transcription factors. This expression increased citrate concentrations by 42-52% (Supplementary Table 10). We hypothesize that this gene cluster may have a particular role in citrate production and additional undescribed functions involving the fatty acid synthaselike genes.
Carbon utilization is not correlated with carbohydrate-active enzyme content. Aspergilli have a particularly broad ability to degrade and convert plant biomass 31 . It is thus essential to examine the species diversity of this trait at the genotype and phenotype levels. We predicted the carbohydrate-active enzyme (CAZyme) gene content of the genomes across section Nigri (17,903 CAZyme domains; Fig. 4 and Supplementary Table 11) and performed growth profiling on plant biomass-related carbon sources ( Supplementary  Fig. 6). Growth on d-glucose was used to evaluate relative growth, showing variation between species.
In a previous study 10 , enzyme levels were measured in several black aspergilli, and significant differences were found. However, differences in enzyme levels do not reflect the copy number differences seen here (Supplementary Table 11). Considering the relative uniformity of the CAZyme content (Fig. 4), no correlation between genome content and growth on plant biomass-related carbon sources ( Supplementary Fig. 6) was observed for the black aspergilli, suggesting that the differences in capability for plant biomass degradation reflect gene expression levels in the individual fungus. This confirms a proteome study of less-related aspergilli, in which the different response to plant biomass appeared to be mainly at the regulatory level 43 . The data suggest that this is the case for section Nigri: species-specific phenotypes are driven not generally by CAZyme content in closely related species, but by regulation.

Secondary metabolism in section Nigri contains 455 families.
Secondary metabolism is thought to be a component of chemical defense, virulence, toxicity, mineral uptake, and communication in fungi 44 and has a wide range of potential medical applications. As we had identified it to be commonly unique to individual species, we examined the exometabolite diversity of 37 Aspergillus and Penicillium species according to predictions of secondary   Supplementary  Table 11. Growth profiles are available in Supplementary Fig. 6. All black aspergilli grew well on pectin and have a highly conserved and extensive set of genes encoding pectin-active enzymes. Growth on other plant polysaccharides such as xylan, starch, and guar gum was more variable, despite the presence of highly conserved genes related to xyloglucan and starch degradation. The growth and genetic variability on inulin are particularly high: nine species showed reduced growth. Moreover, endoinulinase (GH32 INU) is only present in eight of the black aspergilli, while the remainder of inulin-related genes (GH32 INV and INX) are more commonly present (Supplementary Table 11). However, the growth phenotypes show no correlation with the gene content ( Supplementary Fig. 6).

NaTurE GENETics
metabolism gene clusters (SMGCs) as well as chemical profiles of the species of section Nigri on multiple substrates. We identified 2,717 SMGCs in the 37 genomes. This is an even higher number of SMGCs per species than a previous study found in 24 Penicillium genomes 45 . We were further interested in quantifying the actual diversity of the SMGCs in section Nigri and in analyzing presence patterns of SMGCs across species. We therefore defined SMGC 'families' as genetically similar SMGCs across genomes (Methods). Each SMGC family is expected to produce the same or similar compounds. This clustering resulted in the definition of 455 SMGC families across the 37 genomes ( Supplementary  Fig. 7), indicating the potential production of 455 different chemical families. Most families (82%) are found in fewer than 10 organisms, and 49% contain only one gene cluster ( Supplementary Fig. 8 shows examples). On average there are 8.75 unique clusters per species, despite the close phylogenetic distance of the section.

Phylogenetic examination shows dynamic content of SMGCs.
To reveal more about how SMGCs evolve and differentiate between species, each of the 455 SMGC families was characterized by the type of backbone enzyme and analyzed according to the phylogeny (Fig. 5a,b). Only five out of all SMGCs were present in all analyzed species, including clusters for the non-ribosomal peptide synthetase (NRPS)-derived siderophore ferrichrome, the circular NRP fungisporin 46 /nidulanin A 47 , and pigment (YWA1) synthesis. Two shared SMGC families were false predictions, namely two fatty acid synthases.
Examining the dynamics of the families, only 32% and 19% of SMGCs found in two or three organisms, respectively, follow the whole-genome phylogeny and suggest intragenus horizontal gene transfer or SMGC loss to be relatively common. As an example, an SMGC is found in five distantly related species ( Supplementary  Fig. 8b). The cluster is found in all A. niger isolates as well as in A. homomorphus, A. welwitschiae, A. sclerotiicarbonarius, and A. brasiliensis.
As seen in Fig. 5a, the presence of unique SMGC families at every major branch point in the phylogenetic tree supports that SMGCs are a part of what sets the species apart: all biseriates share a previously undescribed polyketide synthase (PKS) and an NRPS-like protein, the A. carbonarius clade a terpene cyclase, and the remaining biseriates share another PKS and another NRPS-like protein. Furthermore, the A. niger complex and A. tubingensis clade each share unique PKS genes. Uniseriates share four unique previously undescribed SMGC families (Fig. 5a). Examinations of individual species reveal that every single section Nigri species has a unique combination of SMGCs (Fig. 3b)

NaTurE GENETics
species genomes (with the exception of A. tubingensis, A. niger, A. brasiliensis, and A. vadensis) encode one or more unique SMGCs. These patterns show the existence of high diversity of SMGCs between species and of a homogeneous set of SMGCs within isolates from the same species.
Correlating secondary metabolisms with SMGC families links gene to function. As a further application of the constructed SMGC families, we hypothesized that we can correlate SMGC families to classes of compounds. We performed extensive exometabolome analysis of 27 of the sequenced strains and identified 35 compound families ( Fig. 5c and Supplementary Table 12). The most abundant group was naphtha-γ -pyrones, of which aurasperone B 29 was identified in 14 of the isolates. We compared the presence patterns of SMGC families with the compound class (Fig. 5c) and combined it with a knowledge-based filtering of InterPro domains leaving one hit (Methods and Supplementary  Fig. 8d). The candidate SMGC family is a nine-gene cluster found in 18 genomes-including the 14 where we detected the compound-and it contains all activities needed to synthesize aurasperone. In support of this identification, an SMGC for a closely related compound, aurofusarin, has been experimentally verified in Fusarium graminearum 48 . The aurasperone cluster shares six genes (one of which is a duplication) with the aurofusarin cluster. This finding supports the assignment of this family of SMGCs to the production of aurasperone B and conceptually justifies this approach for efficient linking of clusters to compounds. We see this correlation approach as highly useful for future elucidation of fungal metabolites.

Discussion
We have sequenced the genomes of a whole section of filamentous fungi, and a diverse set of A. niger isolates, and found that the species are highly diverse in some traits, in particular secondary metabolism and to a lesser extent regulatory proteins, and homogeneous in others, such as glycolytic metabolism and CAZymes. The presented data furthermore provide an extensive compendium of 24 new genomes, which adds substantial information on fungal genetic diversity. We further combined genome analysis with metabolite profiling and heterologous gene expression to identify the genetic basis of several phenotypes within primary and secondary metabolism.
Of particular interest was the finding that the species-specific genes in all species share functions within gene/protein regulation and secondary metabolism, showing that unique sets of these functions exist for all species in the investigated set.

Online content
Any methods, additional references, Nature Research reporting summaries, source data, statements of data availability and associated accession codes are available at https://doi.org/10.1038/ s41588-018-0246-1.

Methods
Fungal strains. Unless otherwise noted, the species examined were taken from the IBT Culture Collection of Fungi at DTU. Strains employed in this study are denoted in Supplementary Table 1. Table 1), spores were defrosted from storage at − 80 °C and inoculated onto solid CYA medium. Fresh spores were harvested after 7-10 d and suspended in a 0.1% Tween solution. Spores were stored in solution at 5 °C for up to 3 weeks. Biomass for all fungal strains was obtained from shake flasks containing 200 ml of complex medium, either CYA, MEAox, or CY20 depending on the strain (see Supplementary Table 1) cultivated for 5-10 d at 30 °C. Biomass was isolated by filtering through Miracloth (Millipore, 475855-1R), freeze dried, and stored at 80 °C. DNA isolation was performed using a modified version of the standard phenol extraction (ref. 52 and below) and checked for quality and concentration using a NanoDrop (BioNordika). RNA isolation was performed using the Qiagen RNeasy Plant Mini Kit according to the manufacturer's instructions.

Purification of DNA and RNA. For all sequences generated for this study (Supplementary
A sample of frozen biomass was subsequently used for RNA purification. First, hyphae were transferred to a 2 ml microtube together with a 5 mm steel bead (Qiagen), placed in liquid nitrogen, then lysed using the Qiagen TissueLyser LT at 45 Hz for 50 s. Then the Qiagen RNeasy Mini Plus Kit was used to isolate RNA. RLT Plus buffer (with 2-mercaptoethanol) was added to the samples, vortexed, and spun down. The lysate was then used in step 4 in the instructions provided by the manufacturer, and the protocol was followed from this step. For genomic DNA, a protocol based on Fulton et al. 53  For all genomic Illumina libraries, 100 ng of DNA was sheared to 270 bp fragments using the Covaris LE220 (Covaris) and size selected using SPRI beads (Beckman Coulter). The fragments were treated with end-repair and A-tailing and ligated to Illumina-compatible adapters (IDT) using the KAPA-Illumina library creation kit (KAPA Biosystems).
For transcriptomes, stranded complementary DNA libraries were generated using the Illumina TruSeq Stranded Total RNA LT Sample Prep Kit. Messenger RNA (mRNA) was purified from 1 µ g of total RNA using magnetic beads containing poly(T) oligos. mRNA was fragmented using divalent cations and high temperature. The fragmented RNA was reverse transcribed using random hexamers and SSII (Invitrogen) followed by second-strand synthesis. The fragmented complementary DNA was treated with end-pair, A-tailing, adapter ligation, and 10 cycles of PCR.
The prepared libraries were quantified using KAPA Biosystems' nextgeneration sequencing library quantitative PCR kit and run on a Roche LightCycler 480 real-time PCR instrument. The quantified libraries were then multiplexed with other libraries, and library pools were prepared for sequencing on the Illumina HiSeq sequencing platform using a TruSeq paired-end cluster kit, v3, and Illumina's cBot instrument to generate clustered flow cells for sequencing. Sequencing of the flow cells was performed on the Illumina HiSeq2000 sequencer using a TruSeq SBS sequencing kit, v3, following a 2 × 150 indexed run recipe.
After sequencing, the genomic FASTQ files were quality control-filtered to remove artifacts/process contamination and assembled using Velvet 54 . The resulting assemblies were used to create in silico long mate-pair libraries with inserts of 3,000 ± 90 bp, which were then assembled with the target FASTQ using AllPathsLG release version R47710 55 . Illumina transcriptome reads were assembled into consensus sequences using Rnnotator v3.3.2 56 .
For the genomes of A. heteromorphus, A. eucalypticola, and A. sclerotioniger, amplified libraries were generated using a modified shearing version of the Pacific Biosciences standard template preparation protocol. To generate each library, 5 μ g of genomic DNA was used. The DNA was sheared using a Covaris LE220 focusedultrasonicator with their Red miniTUBES to generate fragments 5 kb in length. The sheared DNA fragments were then prepared according to the Pacific Biosciences protocol using their SMRTbell template preparation kit, where the fragments were treated with DNA damage repair (ends were repaired so that they were blunt ended and 5′ phosphorylated). Pacific Biosciences hairpin adapters were then ligated to the fragments to create the SMRTbell template for sequencing. The SMRTbell templates were then purified using exonuclease treatments and size-selected using AMPure PB beads.
Sequencing primer was then annealed to the SMRTbell templates, and version P4 sequencing polymerase was bound to them. The prepared SMRTbell template libraries were sequenced on a Pacific Biosciences RS II sequencer using version C2 chemistry and 2 h sequencing movie run times. The three Pacific Biosciences genome datasets were assembled using HGAP3 (see URLs).
All genomes were annotated using the JGI annotation pipeline 33  See also the Nature Research Reporting Summary linked to this article.
Analysis of secondary metabolism. Cultivation for secondary metabolite analysis. Fungal strains were cultivated as three-point cultures on CYA, CYAS 29 , and YES media for 7 d in the dark at 25 °C. Three 6 mm inner diameter plugs taken across the cultures were then extracted using an (3:2:1) (ethylacetate-dichloromethanemethanol) mixture and dissolved in methanol 57 .
Chemical analysis of secondary metabolites. All chemical analyses were done by reversed-phase ultrahigh-performance liquid chromatography (UHPLC) coupled to ultraviolet-visible diode array detection (DAD) combined with either fluorescence detection (FLD) or high-resolution mass spectrometry (HRMS). Three different methods were used: Method 1. Pure UHPLC-DAD-FLD was performed using a rapid-separation liquid chromatography (RSLC) UltiMate 3000 system (Dionex) linked to an 1100 Series FLD (Agilent). The system was equipped with an Agilent Poroshell phenyl-hexyl column (150 × 2.1 mm, 2.6 μ m) and was run using a linear gradient of wateracetonitrile starting at 10% acetonitrile and increasing to 100% (both containing 50 ppm trifluoroacetic acid) over 8 min, then using 100% acetonitrile for 2 min. The column temperature was 60 °C, the flow rate 0.8 ml min −1 , and the injection volume was 1 µ l. The ultraviolet spectra 200-640 nm were matched against our internal database 29 . Full-scan data were analyzed as above in MassHunter 59 . MS/MS data were matched to our internal MS library (~1,700 compounds) of reference standards and tentatively identified compounds 59 .
Genome annotation and analysis. Genome annotation. All genomes were annotated based on the JGI annotation pipeline 34 as previously described 60 .
Whole-genome phylogeny. Protein sequences of all organisms were compared using BLASTp (e-value cut-off 1 × 10 −5 ). Orthologous groups of sequences were constructed on the basis of best bidirectional hits. Two hundred groups with a member from each species were selected, and the sequences of each organism were concatenated into one long protein sequence. Concatenated sequences were aligned using MAFFT (thread 16), and well-aligned regions were extracted using gBlocks (− t = p; − b4 = 5; − b5 = h). Trees were then constructed using multithreaded RAxML, the PROTGAMMAWAG model, and 100 bootstrap replicates.
Secondary metabolite-specific PFAM domains were taken from Supplementary Table 2 of the SMURF paper 61 .
As input, the program takes genomic coordinates and the annotated PFAM domains of the predicted genes. Based on the multidomain PFAM composition of identified 'backbone' genes, it can predict seven types of secondary metabolite clusters: (1) polyketide synthases (PKSs), (2) PKS-like, (3) non-ribosomal peptidesynthetases (NRPSs), (4) NRPS-like, (5) hybrid PKS-NRPS, (6) prenyltransferases (DMATS), and (7) terpene cyclases (TCs). Besides backbone genes, PFAM domains, which are enriched in experimentally identified secondary metabolite clusters (secondary metabolite-specific PFAMs), were used in determining the borders of gene clusters. The maximum allowed size of intergenic regions in a cluster was set to 3 kb, and each predicted cluster was allowed to have up to 6 genes without secondary metabolite-specific domains.
Prediction of secreted proteases. Secretome prediction was done using an in-house adaptation of SignalP 62 .
Gene-compound assignment. Identification of conserved or highly similar fungal gene clusters was performed on the basis of the gene cluster predictions above. The genomes were compared using the BLASTp function from the BLAST+ suite 63 . Presence/absence of an orthologous gene to a member in a gene cluster was based on a bidirectional best hit, with e < 1 × 10 −100 and coverage of > 90%. Presence/ absence of a full gene cluster was based on the occurrence of the majority of the predicted members in a gene cluster, including the backbone synthetase in another species.
Detection of encoded CAZymes. Each Aspergillus protein model was compared using BLASTp to proteins listed in the Carbohydrate-Active enZYmes database (CAZy) 64 . Models with over 50% identity over the entire length of an entry in CAZy were directly assigned to the same family (or subfamily when relevant). Proteins with less than 50% identity to a protein in CAZy were all manually inspected, and conserved features, such as the catalytic residues, were searched whenever known. Because 30% sequence identity results in widely different e-values (from non-significant to highly significant), for CAZy family assignments, we examined sequence conservation (percentage identity over CAZy domain length). Sequence alignments with isolated functional domains were performed in the case of multimodular CAZymes. The same methods were used for Penicillium chrysogenum and Neurospora crassa.
Mapping of genes shared by groups of species. All predicted sets of protein sequences for the 38 genomes analyzed were aligned using the BLASTp function from the BLAST+ suite version 2.2.27 ( e-value cut-off ≤ 1 × 10 −10 ). These 1.444 whole-genome BLAST tables were analyzed to identify bidirectional hits in all pairwise comparisons. Using custom Python scripts, homologs were identified within and across the genomes and grouped into sequence-similar families using single linkage, if they met the following criterion: The sum of the alignment coverage between the pairwise sequences was > 130%, the alignment identity between the pairwise sequences was > 50%, and the hit must be found in both of the species' BLAST output (reciprocal hits). Singletons were assigned a family having only one gene member. This allowed for identification of species-unique genes as well as genes shared by sections, clades, and sub-clades of species. All homologs were assigned functional and structural domains using InterPro version 48 65 and checked for annotation and sequencing errors by investigating scaffold location and sequence identity.
For the analysis of the pan-and core-genomes of a subset of 38 fungal species used in this study, the orthologous and paralogous families were subsetted to include only the species of interest. Therefore, the genes representing the core and unique portion of the genomes will adjust relative to the accompanying species.
Identification of SMGC families. Our implementation of SMURF was run on genomic data from 37 Aspergillus strains. Proteins of the resulting SMGCs were compared with each other by alignment using BLASTp (BLAST+ suite version 2.2.27, e-value ≤ 1 × 10 −10 ). Subsequently, a score based on BLASTp identity and shared proteins was created to determine the similarity between gene clusters as depicted in the formula below. Using these scores, we created a weighted network of SMGC clusters and used a random walk community detection algorithm (R version 3.3.2, igraph_1.0.1 66 ) to determine families of SMGC clusters. Finally, we ran another round of random walk clustering on the communities that contained more members than species in the analysis (ptailoring/pbackbone = sum of percentage BLAST alignment of tailoring/backbone enzymes, respectivly; ntailoring/ nbackbone = number of tailoring/backbone enzymes with significant hits, respectivly; ttailoring/tbackbone = total number of tailoring/backbone enzymes): To create a cluster similarity score, a combined score of tailoring and backbone enzymes was created. The sum of the BLASTp percent identity (ptailoring/ pbackbone) of all hits for tailoring enzymes between two clusters was divided by the maximum amount of tailoring enzyme (ttailoring/tbackbone) and multiplied by 0.35. Then the score for the backbone enzymes was calculated in the same manner but multiplied by 0.65 to give more weight to the backbone enzymes. The scores were added to create an overall cluster similarity score: × . + × . avg(pident ) 0 35 avg(pident ) 0 65 tailoring b ackbones

Identification of shared SMGC families at nodes of the phylogenetic tree.
A list containing organisms of each branch of the phylogenetic tree was created and compared with the list of organisms for each SMGC family. If all organisms of a family matched, the count on the corresponding node was increased by one.
Prediction of the aurasperone B gene cluster. Lists of organisms for all SMGC families were compared with the lists of aurasperone B-producing species and filtered for InterPro annotations containing the terms 'cytochrome P450' or 'methyltransferase' .
Primary metabolism. Copy numbers were assessed using the homologous protein families generated during the analysis of genome diversity. The gene pathway associations were taken from the A. niger genome-scale model 41 . All proteins in the respective protein families were considered putative isozymes and were included in the copy number analyses.
Comparing the putative isoenzymes in the different species, gene sequences were aligned and clustered using neighbor-joining with MUSCLE v3.8.31 68 . Resulting trees were visualized and edited for publication using the Python ETE Toolkit 67 . Subcellular localizations for the genes included in the analysis were predicted using the TargetP 69 web server (see URLs).
Reporting Summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Code availability. Code for generation of gene families of homologs and for generation of SMGC families is available from GitHub (see URLs).

Data availability
All genomes used in the study are available from Joint Genome Institute fungal genome portal MycoCosm (http://jgi.doe.gov/fungi). All new genomes published in the study have been deposited in the DNA Data Bank of Japan