Identifying the genomic changes that control morphological variation and understanding how they generate diversity is a major goal of evolutionary biology. In Heliconius butterflies, a small number of genes control the development of diverse wing colour patterns. Here, we used full-genome sequencing of individuals across the Heliconius erato radiation and closely related species to characterize genomic variation associated with wing pattern diversity. We show that variation around colour pattern genes is highly modular, with narrow genomic intervals associated with specific differences in colour and pattern. This modular architecture explains the diversity of colour patterns and provides a flexible mechanism for rapid morphological diversification.
Recent adaptive radiations, such as the Heliconius butterflies 1 , Galápagos finches 2 and African cichlids 3 , offer insight into evolutionary and ecological forces that underlie diversification. Typically, ecological opportunities allow natural and sexual selection to drive adaptive change and speciation. At a genetic level, recruitment from ancient polymorphism, introgression of adaptive variants between populations and de novo mutation are important sources of variation. However, the genetic architecture of the traits under natural and sexual selection that facilitates rapid diversification remains largely unexplored.
In this study, we sequenced the genome of the Neotropical butterfly Heliconius erato and used re-sequence data from 116 additional individuals to dissect the architecture of genomic variation associated with their vividly coloured wing patterns. With over 400 different wing colour forms among 46 described species 4 , Heliconius represents one of the most visually diverse radiations in the animal kingdom and an excellent system for establishing a broad and integrative view of morphological diversification. The evolution of scale cells and the spatial coordinate system that controls wing pigmentation is a key innovation of the Lepidoptera. Wing patterns are often under strong natural and sexual selection, and these forces probably shape much of the pattern diversity we see among the more than 160,000 butterfly and moth species 5 .
In Heliconius, conspicuous wing patterns are important for signalling toxicity to potential predators
and play a role in mate selection
. Natural selection favors Mìllerian mimicry among toxic butterflies, resulting in convergence between co-occurring species, as well as geographic divergence between populations of the same species
. Among Heliconius butterflies, the genetic basis of this wing diversity has been studied for nearly 60 years and more than 30 Mendelian loci have been described
. Over the past decade, however, genetic research has shown that most of the complexity of colour variation across Heliconius is actually controlled by relatively few genes acting broadly across the fore- and hindwing
Here, we sequenced the genomes of 15 distinctly coloured H. erato races and 8 closely related species to fully describe the regulatory architecture driving adaptive evolution of the major genes acting in Heliconius wing patterning (Fig. 1). Our genomic survey included samples obtained near seven transition zones of hybridizing H. erato races with divergent wing patterns (Fig. 2a). In these hybrid zones, the high rate of genetic admixture allows for detailed genotype by phenotype (G × P) association mapping to identify discrete genomic intervals associated with colour and pattern variation on Heliconius wings 21,22 . We then further investigated these intervals with a novel phylogenetic method for identifying conserved non-coding regions in closely related non-hybridizing races and species. This combined strategy of association mapping and phylogenetic inference resulted in a distinct set of narrow genomic intervals that corresponded to loci described in early crossing experiments 9 (Supplementary Table 1). All the intervals fell within non-coding regions adjacent to colour pattern genes that affect forewing band shape (wntA; Fig. 3), red pigmentation (optix; Fig. 4) and a yellow hindwing bar (cortex; Fig. 5). Our results underscore a highly modular regulatory architecture that provides a flexible mechanism for rapid morphological change (Fig. 6).
Results and discussion
Reference sequence and variants
With more than 25 different wing pattern races, H. erato provides exceptional opportunities to explore the links between genotype, phenotype, form and function. We first constructed a high-quality reference genome by a combination of hybrid assembly coupled with high-resolution linkage analysis. Our assembly and validation strategy generated one of the most contiguous and accurate Lepidopteran genomes assembled thus far (Supplementary Section 2), which is available on the LepBase genome browser. The final assembly consisted of 198 scaffolds with N50 length of over 10 Mb and a total assembly length of 383 Mb. A total of 13,678 genes were identified using RNA-seq and a thorough annotation process (Supplementary Section 3). To examine variation across our reference genome, we generated high (15–30×) coverage whole-genome resequence data from 116 individuals of H. erato and closely related species. For the 101 H. erato individuals sampled, we genotyped the majority of the non-repetitive portion of the genome (average of 62% per individual; Supplementary Section 4.1). For the 15 individuals from the 8 outgroup species, the number of positions that were genotyped for the outgroup species was lower, but above 40% for the most divergent comparison (Supplementary Section 4.1).
Genome-wide divergence across the H. erato colour pattern radiation.
Within H. erato, individuals clustered by geographic proximity rather than colour pattern phenotype, as has been previously reported 23 (Fig. 1b,c). For example, forewing red banded H. erato races were found in all three (Caribbean/Pacific Coast, East Amazonian, and West Amazonian) major geographic lineages (Fig. 1). Even within these broad geographic regions, individuals used in this study grouped together by sampling location rather than wing morphology. Indeed, there was little genetic differentiation between H. erato individuals sampled across major phenotypic transition zones, except around the genomic regions already known to be involved in colour pattern variation (Fig. 2a). Genetic divergence as measured by FST (see Methods) was close to zero across most of the genome, supporting the hypothesis of unhindered gene flow except at the regions responsible for colour pattern differences (FST < 0.1 in 97.07 ± 0.03% of 50 kb windows; Supplementary Section 3.3) 22 . This contrasted with three sharp peaks of genomic differentiation across known colour pattern loci on chromosome 10 near the wntA gene, on chromosome 15 near cortex, and on chromosome 18 near optix (shown in red in Fig. 2b). As previously reported for the region around optix 22 , these regions showed the expected signatures of selection, including reduced nucleotide diversity and elevated dXY relative to genome-wide averages (Supplementary Section 4.3).
Associating genomic variation with colour pattern diversity.
Genetic differences at the regions controlling phenotypic variation in Heliconius are maintained by strong natural selection
Modular architecture of forewing black colour variation
Recent genetic mapping coupled with studies of gene expression, suggest that a single gene, wntA, is driving much of the forewing pattern variation across Heliconius species 27 . Indeed, our G × P association highlighted a 100 kb non-coding region near wntA on chromosome 10 (Fig. 3). Clusters of fixed SNPs defined discrete genomic intervals associated with the phenotypic effects of the Sd, St and Ly loci that were first described more than 30 years ago 9 . Variation at Sd, St and Ly was predicted to control patterning across the middle to the most distal sections of the forewing, respectively (Fig. 3a). Consistent with this hypothesis, we identified: (1) a 25 kb region of fixed differences between H. e. notabilis and H. e. lativitta that differed across the lower (Sd) and the middle (St) region of the forewing (shown in purple in Fig. 3b); (2) a narrow peak of association between H. e. notabilis and H. e. etylus that differed only in the lower forewing region (Sd) (blue in Fig. 3b); and (3) a broad region of association that spans roughly 60 kb and appears to be composed of several distinct peaks between H. e. erato and H. e. hydara from French Guiana that differed in St and Ly (orange in Fig. 3b). Comparisons between races with identical forewings showed no G × P association across any of these regions (green in Fig. 3b).
To further refine the regions associated with forewing band pattern, we used a novel tree weighting approach called Twisst (topology weighting by iterative sampling of subtrees; see Methods) 28 to explore how phylogenetic relationships varied around wntA. We hypothesize that the genomic variation underlying wing pattern differences should cluster individuals by wing pattern rather than geographic proximity. Sliding window phylogenetic comparisons identified four narrow genomic intervals near wntA that were strongly associated with changes in the spatial distribution of black scales on the forewing (Fig. 3c). The first region was a 10 kb interval roughly 50 kb upstream of wntA (blue in Fig. 3c) that supported the monophyletic grouping of races that are partially black in the lower midsection of the forewing extending just distal of the discal cell region. Similarly, a separate 8 kb interval roughly 35 kb upstream of wntA grouped geographically distant individuals with similar distribution of black scales across most of the distal mid-section of the forewing (St interval) (green in Fig. 3c). Finally, two additional regions, one 25 kb upstream of wntA and another centered on wntA, grouped all individuals that were partially black in the upper section of the forewing (Ly intervals) (orange in Fig. 3c). Although the region centered on wntA showed some support for tree topologies based on geographic proximity, we still considered it a possible colour pattern interval because the phenotypic grouping is more strongly supported than geographic grouping. Other areas across this region supporting the phenotypic tree also showed similar support for tree topologies based on geographic proximity and were not considered as candidate colour pattern intervals.
Our genomic analysis also confirmed a new locus (Ro) responsible for pattern variation in the most distal region of the forewing band
. Comparisons of H. e. notabilis and H. e. lativitta showed an approximately 71 kb region associated with pattern differences in the upper forewing (purple in Fig. 3b). Similar to the wntA region, G × P associations were localized to non-genic regions near two genes, the Heliconius homologue of the ventral veins lacking gene (vvl) and the homologue of radial spoke head protein 3 (rsp3). The transcription factor vvl is involved in the formation of specific wing veins, neuronal differentiation and steroid production in Drosophila melanogaster
Modular architecture of red pattern variation.
Regulation of red patterns across the fore- and hindwing of H. erato, known to be under control of the gene optix 14,17 , was also highly modular. We identified discrete genomic intervals near optix that were associated with the presence of red hindwing rays, a red patch (‘dennis’) in the proximal part of the forewing and a red forewing band. We use the original nomenclature in H. erato for these different pattern elements: R for red hindwing ‘rays’, D for a red dennis forewing patch and Y for forewing ‘band’ colour (Fig. 4a) 9 .
Associations between individuals that differed across all three pattern elements, the so-called ‘dennis-rayed’ and ‘postman’ phenotypes, were strongly clustered in a 69 kb region downstream of optix (Fig. 4b) 26 . Within this 69 kb region, G × P associations between hybridizing H. e. amalfreda and H. e. erato, which differ only by the absence/presence of hindwing rays, were clustered in a 7 kb interval (Fig. 4b). In this interval, H. e. amalfreda possessed the postman haplotype, which contrasts with the rest of the 69 kb region where H. e. amalfreda shared a haplotype with H. e. erato. Phylogenetic trees constructed from this region grouped H. e. amalfreda with postman phenotypes that lack rays (red shading in Fig. 4c). Unexpectedly, the tree across this interval clustered the outgroup species — H. telesiphe, H. hortense, H. hecalesia, H. clysonymus, and H. sara — on a derived node with all rayed H. erato races (Supplementary Section 5.3.2). Heliconius hecalesia, H. hortense, and H. clysonymus all have large red hindwing patches, whereas, H. sara and H. telesiphe possess much smaller red spots on the underside of their hindwing. This pattern contrasts with the phylogenetic placement of these species in the tree constructed with data from the rest of the genome (Fig. 1a), possibly reflecting historical introgression of modular elements among species closely related to H. erato. Such patterns of introgression have also been observed in other closely related Heliconius species 1,37 .
Genomic intervals strongly associated with forewing band colour (Y) and the red dennis patch (D) were similarly localized using the combination of G × P association and phylogenetic weighting. For forewing band colour, we identified two distinct and narrow intervals separated by approximately 20 kb (yellow in Fig. 4b,c). In these regions, there were 15 fixed SNPs that distinguished butterflies with a red forewing band from those that lacked red. Phylogenetic trees from this region strongly supported clustering of the red-banded phenotypes H. telesiphe, H. hermathena, H. e. favorinus and H. e. hydara, whereas H. himera, H. hortense, H. clysonymus and H. hecalesia, all of which lack red on the forewing, grouped with the yellow-banded H. erato races (Fig. 4c and Supplementary Section 5.3.2). Finally, we identified several intervals associated with the red dennis patch. For this analysis, we focused primarily on genetic variation within H. himera. Heliconius himera has red on the hindwing similar to rays, but lacks the dennis patch. Therefore, comparing H. himera and H. erato races with a dennis/rays phenotype allowed us to separate the dennis from the ray elements. Across the 69 kb region, there was a 12 kb area where H. himera genotypes were similar to the postman haplotype (grey in Fig. 4b). Phylogenetic weighting analysis in this area strongly supported the grouping of H. himera individuals by colour pattern phenotype with postman races from both sides of the Amazon basin (grey in Fig. 4c).
Independent modules generate convergent yellow hindwing bar phenotypes.
Recent association and expression data implicated the gene cortex as important in controlling a variety of pattern elements across the Heliconius wing, including presence or absence of a yellow hindwing bar in H. erato, known as the Cr locus 9,16 . In H. erato, we identified two discrete regions containing clusters of fixed sites associated with a yellow hindwing bar in two geographically isolated, yet phenotypically similar, H. erato races (Fig. 5). The Peruvian races H. e. favorinus and H. e. emma differed across an interval consisting of 269 fixed SNPs over 100 kb roughly centered on cortex (red in Fig. 5). Eight of these SNPs fell within the coding region of cortex, but only one resulted in an amino acid substitution (an arginine to lysine at scaffold Herato1505 position 2,087,610). Curiously, a different region distinguished the Panamanian races H. e. demophoon and H. e. hydara (green in Fig. 5), which show a similar difference in the presence/absence of a yellow hindwing bar. In this hybrid zone, there was a cluster of fixed differences located roughly 100 kb away and centered on the Heliconius homologue of parn, a poly(A)-specific ribonuclease. These association differences are consistent with the independent evolution of the yellow hindwing bar on either side of the Andes 34,38 .
In H. erato, there are other colour pattern elements controlled by variation at this locus, including the presence/absence of white hindwing fringes and a yellow forewing line 39 , but our sampling of H. erato races did not allow us to distinguish these elements (Supplementary Section 5.4). The hybrid zone comparisons H. e. notabilis/H. e. lativitta and H. e. notabilis/H. e. etylus also showed increased FST estimates near the cortex gene, but no pattern of perfect association was observed for these comparisons. Crossing experiments have suggested possible epistatic interactions between cortex and wntA 38,40 , which provides a possible explanation for this increased divergence without any phenotypic effect known to be directly controlled by the cortex locus. Furthermore, the phenotypic effects of alleles at this locus can be dramatic in other Heliconius species 16 , suggesting that this locus interacts broadly with the other Heliconius patterning loci 10,41 .
Modular regulatory architecture and pattern diversity within H. erato.
Less than 0.2% of the genome was associated with wing pattern diversity across the H. erato radiation. This variation was highly modular and fell in non-coding regions near colour patterning genes, including optix, wntA and cortex
Both shuffling of existing modules and de novo evolution of new modules is associated with phenotypic diversity in H. erato. Indeed, we can recreate the colour pattern diversity across the H. erato radiation using a combination of non-genic regions near four colour pattern genes (Fig. 6). This conclusion is perhaps best exemplified in the distribution of genetic variation around wntA, where different colour pattern races have different combinations of four distinct genomic intervals. These different intervals are likely to regulate the expression of wntA in different areas of the forewing to adjust the position, size and shape of the forewing band to closely match patterns in other co-occurring warningly coloured butterfly species. Within this modular framework, recombination can reshuffle existing regulatory variation to generate new combinations of regulatory elements and new wing pattern phenotypes. Recombination of colour pattern modules and introgression into other populations is likely to be driven by high rates of gene flow between adjacent populations. For example, H. e. amalfreda appears to have evolved via recombination of regulatory variation between rayed (H. e. erato) and red-banded (H. e. hydara) haplotypes that instantaneously generated a novel wing pattern, a process that closely mirrors the one recently described in the co-mimetic forms of H. melpomene 37 .
New regulatory modules associated with wing pattern variation can also evolve de novo, further increasing the flexibility of these regions to generate pattern diversity. This was evident in the independent evolution of the yellow hindwing bar in the H. erato clade (Fig. 5), and also in the comparison of regulatory variation around the red patterning locus between H. erato and its co-mimic H. melpomene. Red pattern variation in the two species is similarly generated by regulatory differences at the optix locus 14 , and the genomic position and order of its cis-regulatory elements is broadly similar 26 . Furthermore, in both species distinct intervals were associated with different red pattern elements, and ‘enhancer shuffling’ through recombination has similarly generated novel red pattern phenotypes 37 . This implies considerable conservation of function of optix cis-regulatory regions that were re-used to generate the convergent patterns that underlie mimicry. Nonetheless, the precise elements associated with placement of red in discrete areas of the fore- and hindwing are not homologous in the two species (Supplementary Section 5.3.3). Thus, convergent patterns are clearly independently derived in the two radiations by the parallel evolution of new enhancer variation.
Our results reconcile decades of genetic and genomic studies of Heliconius colour pattern variation 9,42 . For the first time, we were able to place an entire radiation within a single genomic framework. This work has reinforced the role of a simple toolkit of a few colour pattern genes and demonstrated that pattern diversity is likely to be generated by the regulatory complexity around these genes. We have characterized a discrete number of 1–7 kb intervals that modulate phenotypic variation, and show that divergent and convergent morphologies, are the product of enhancer shuffling and de novo independent evolution of these modules. Overall, our work provides a genomic framework to further explore this regulatory complexity. The regions we identified may contain a number of distinct regulatory elements that may be further resolved with chromatin accessibility data 43 and studied in detail with targeted genome editing. Such an integrated genomic view promises to accelerate our understanding of the links between genotype and phenotype, and how they play out on a developing butterfly wing. This research has broader ramifications because the small number of genes shown to generate wing pattern variation across Heliconius have been implicated in pattern variation in other butterflies and moths 16,19,44 . Thus, the Heliconius wing pattern loci appear to be ‘genomic hotspots’ that underlie the evolution of phenotypic diversity in Lepidoptera. The radiation of warning colours in H. erato provides an example of regulatory complexity generated by a small toolkit of genes. This may well be a common hallmark of rapid morphological diversification in adaptive radiations.
Scaffold assembly and validation.
The H. erato (race demophoon) genome was assembled using Illumina paired-end reads with different insert sizes and partially gap filled with PacBio data (Supplementary Table 2). Illumina data was produced according to the ALLPATHS-LG assembly protocol 45 with the paired-end library originating from a single individual and the mate pair libraries from a second, sibling, individual. An initial assembly was performed with ALLPATHS-LG using default parameters and the reads were mapped back to the assembly to acquire accurate distributions of fragment size for each library. Next, contaminant small fragment sequences were purged from the paired-end and mate-pair libraries. Reads were error-corrected using the software Blue 46 . A kmer database was built from the raw paired-end data and used to remove unsupported reads from mate-paired libraries. This step reduced polymorphism that may cause erroneous assembly. The PacBio data were error-corrected using the Illumina data and the LoRDEC software 47 .
Five assemblies were obtained using different combinations of raw or error-corrected Illumina data. Each assembly was quality checked against approximately 4 Mb of BAC sequences using nucmer 48 . All assemblies gave similar amounts of gapped sequence (about 10% of the base pairs), which reflects long simple repeats scattered across the genome. The assembly with the best statistics (that is, highest N50s and best alignment to BAC) was then post-processed to replace putative tandem repeats with Ns. Small repetitive scaffolds and putative redundant haplotype sequences were removed and based on a combination of ‘all-versus-all’ alignments and depth of coverage estimates prior to performing ALLPATHS-LG scaffolding. Gaps were then filled using the filled fragment pairs, the corrected PacBio data and the small scaffolds that had been previously removed using PBJelly 49 . PBJelly was run three times iteratively to balance sensitivity and specificity and the final assembly, called Hera_Stage1, had a length of 402.8 Mb and scaffold N50 of 612 kb, respectively. The assembly process with associated statistics are provided in Supplementary Table 2 and Supplementary Fig. 1.
We generated a high-resolution linkage map by sequencing a backcross family generated from our focal genomic line (Supplementary Fig. 2). Our strategy was to identify markers by coupling high-coverage, whole-genome sequencing (30–40×) of each parent with low coverage (5–10×) sequencing of their offspring. The low sequencing coverage of the offspring makes it difficult to determine individual genotypes with high accuracy. We therefore developed an in-house pipeline utilizing the mpileup command in SAMtools 50 to produce genotype posteriors over a candidate set of 6.7 million SNPs. These genotype posteriors were used to construct a linkage map with Lep-Map3 (https://sourceforge.net/projects/lep-map3/), a new linkage mapping software developed from the Lep-Map1/2 software 51,52 .
The linkage map was constructed with Lep-Map3 as follows (see Supplementary Figure 3 in SI section 2.3): First, to obtain the most accurate parent genotypes, we calculated the parental genotype posteriors using the combined information from parents and offspring using the ParentCall module (Lep-Map2). Next, we calculated pair-wise LOD scores between markers with zero recombination rate (θ = 0) using the module SeparateIdenticals (Lep-Map3) with lodLimit = 26.5, informativeMask = 12 and numParts = 20. This step identified markers that segregated identically. The 20 most abundant identical maternal markers were used as the chromosome prints (each maternal marker in a chromosome segregates identically as there is no recombination in the female in Heliconius butterflies). In this step, we could identify 20 of the 21 chromosomes, because we found that chromosome 2 was completely homozygous in the mother. To identify chromosomes, especially chromosome 2, in the paternal linkage map, identical paternal markers were joined using module JoinLGs (Lep-Map3) with recombination rate θ = 0.01 and LOD score limit lodLimit = 20. More precisely, the linkage groups could be linked together for chromosome 2 by inspecting the markers at nearby positions in the assembly. These paternal markers clustered to 21 linkage groups identifying chromosome 2 and the same 20 chromosomes that were found in the maternal map. Next, the module ShortPath (Lep-Map3) was run on the identical paternal markers. This module finds the longest shortest path in a marker graph (i.e. the longest path in a graph for which the shortest path is chosen between pairs of markers), where markers are nodes and each marker pair has been connected with an edge of length 4n –3, if there are n detected recombinations (different genotypes considering both phases in this case) between the markers. The best paths were manually checked to determine the final order of the markers. After the maternal and paternal markers were placed within a linkage framework (Supplementary Table 3), we added the remaining markers into this framework using JoinIdenticals (Lep-Map3), with LOD score limits of 25 and 20, for paternal and maternal markers, respectively. The 1.2 million markers that were heterozygous in both parents were discarded (informativeMask = 12). Finally, the identified linkage groups (chromosomes) were named to reflect the nomenclature of the H. melpomene genome. We were able to easily identify homologous chromosomes by mapping the flanking regions of each marker to the H. melpomene genome 1 . Our final linkage map covered all 21 chromosomes, including the Z chromosome.
Assembly correction and chromosomal scaffolding.
We used our high-resolution linkage map to error correct and improve our genome assembly. To do this, we first manually identified scaffolds that were inconsistent with our linkage map. About 10% of the scaffolds, representing 62 Mb, had such errors. Due to the high-density of markers on our linkage map, most errors were localized within a few kb. These errors generally fell at a gap sequence, meaning that the scaffolding step of the assembly process, rather than the creation of contigs, caused most misassemblies. The scaffolds in the assembly with errors were cut to produce an error-free assembly. The assembly was also separated into chromosomes at this point. There was about 16 Mb of gapped sequence in the assembly. The 34 scaffolds that failed to map to chromosomes totaled 3.7 Mb, 3.5 Mb of which were bacterial genome sequence and the rest was mainly very highly repetitive haplotypes that failed to create substantially long (>3 kb) contigs.
We produced the final assembly by integrating information from two independent de novo assemblies to gap fill our oriented stage2 assembly. The first was an ALLPATHS-LG assembly generated from the same Illumina dataset paired-end and mate-paired dataset, and assembled as follows. Illumina paired-end and mate-pair data were subsampled to prescribed coverage depth according to ref. 45 and assembled using ALLPATHS-LG with “HAPLOIDIFY = TRUE” and “CLOSE_UNIPATH_GAPS = False”. The resulting assembly was improved by performing 3 iterations of PBJelly 49 , incorporating prior PBJelly assemblies into subsequent iterations. The second was an assembly of an additional sibling female individual using approximately 100x coverage of 2 x 250 Illumina data generated from PCR free libraries. The genome of this individual was assembled using DISCOVAR de novo 53,54 . The scaffolds that spanned gaps in our assembly were extracted from the BWA-MEM 55 produced bam files using in-house software. This software used a variant of Smith-Waterman local alignment 56 to compute the best alignment to fix gaps. Both positive and negative gaps were considered. The alignment parameters used were +1 for nucleotide match, −4 for mismatch, −8 for gap open and −1 for gap extension. Gaps were filled iteratively, using the independent ALLPATHS assembly first. Here we required an alignment score of 100 across a 4 kb region on each side of a gap for the gap to be filled. Regions with multiple gaps were joined as if they contained a single large gap. Finally, we filled remaining gaps using the DISCOVAR assembly. In this case, we used alignment to 2 kb regions around each gap. Using this strategy, we reduced the number of gaps in our assembly to 5.2 Mb. Assembly completeness, as assessed against a benchmarked set of 2,675 single-copy orthologues using BUSCO 57 was 82% (2,179) in the H. erato genome and a further 11% were present, but marked as ‘fragmented’. These BUSCO results were similar to those for other high quality lepidopteran genomes (Supplementary Table 8). We assembled 5 of 20 autosomes and the Z chromosome into single scaffolds. We failed to identify a W chromosome, probably because of its highly repetitive nature. See Supplementary Figure 4 for the completeness of the scaffolding in the final H. erato genome assembly.
Annotation of the genome was performed using Just_Annotate_My_Genome (JAMg; https://github.com/genomecuration/JAMg). To facilitate annotation, we used RNASeq data generated from different life stages and tissue types (Supplementary Table 9). These data include recent Illumina 2×250 data, 454 data, and archival Illumina 2×50 data. All data were preprocessed using ‘justpreprocessmyreads’ (http://justpreprocessmyreads.sourceforge.net) and were error corrected using Blue 46 with a ‘reference’ kmer dataset derived from the most recently collected 2×250 Illuminia RNA-seq data and a coverage cut-off of 2. The Illumina RNA-Seq data was assembled using Trinity RNA-Seq version 2.1.1 58 with both the ‘de-novo’ and ‘genome-guided’ options. The 454 data alongside all mRNA data acquired from GenBank and public Illumina data acquired from NCBI SRA were assembled and clustered using MIRA 4.9.5 59 . The Trinity de-novo, Trinity genome-guided and the MIRA assemblies were aligned and assembled against the genome using a new version of PASA 60 , thus, creating a non-redundant, intron-aware transcript set referred here as PASA cDNA contigs. The new Illumina RNA-Seq were aligned against the reference H. erato genome using GSNAP v.2015-09-29 61 providing high-quality information on intron coordinates. Repetitive content was identified (simple, complex/ transposable, de novo, tRNA and rRNA elements) using trf 62 , RepeatModeler 63 , RepeatScout 64 , RepeatMasker 63 , RepBase data 65 , tRNAScan 66 and Aragorn 67 . This masked dataset was provided at the last stage of the pipeline only.
We used two de novo gene modellers, GeneMark-ET 68 and Augustus 3.2.1 69 for gene prediction. Both used the intron co-ordinates as external evidence. In addition, Augustus used further external evidence as hints including the RNA-seq coverage derived from the Illumina reads, protein domains acquired from searching the genome against Swissprot using the HHBlits program 70 , a high-quality subset of the PASA cDNA contigs as determined by JAMg, alignments of Uniref50 and the Heliconius melpomene predicted protein set 71 . The Augustus HMM models were trained and evaluated using ‘training’ and ‘test’ subsets of the high-quality PASA cDNA contigs. Following this, the external evidence was weighted using the JAMg optimization method and the same training and test cDNA contig datasets. At this point, we determined that the repeat masking data provided inferior prediction results and thus they were not used in the final prediction. Finally, Augustus was run with UTR prediction enabled to reduce false positive exons. Resulting UTRs were removed from the final prediction.
The Repeat masking information, GenMark-ET, Augustus, PASA cDNA contigs, the Uniref50 and H. melpomene protein alignments were provided to EvidenceModeler 72 to derive a consensus gene dataset. This consensus dataset was then twice edited with PASA2 in order to add alternative splicing information and the UTRs as supported by cDNA evidence. This formed our Official Gene Set (OGS1). The OGS1 proteins were then functionally annotated using Just_Annotate_My_Proteins (JAMp; https://github.com/genomecuration/JAMp) searched against Hidden Markov Profiles of known proteins with manually curated metadata (Swissprot; clustered at 70% identity and aligned). For each significant hit (using the default settings of JAMp such as an e-value of 1e-10 and p-value of 1e-12), any Gene Ontology, ENZYME and KEGG ontology terms of the known Swissprot proteins were linked to the H. erato predicted proteins but only if the annotation evidence was experimentally derived and not inferred (i.e. terms with the evidence codes of 'IEA', 'ISS', 'IEP', 'NAS', 'ND', 'NR' were ignored). The RNA-Seq data was finally aligned against the OGS1 CDS data and processed with DEW (https://github.com/alpapan/DEW) to infer the expression profiles for each gene. The functional and expression annotations are available from http://annotation.insectacentral.org/heliconius_erato.
Sequence alignment and variant calling.
We collected and sequenced 101 individual H. erato butterflies from Peru (n = 15), French Guiana (n = 14), Suriname (n = 5), Ecuador (n = 29), Colombia (n = 12), Bolivia (n = 4), Mexico (n = 6) and Panama (n = 16). We collected phenotypically pure (i.e. phenotypes resembling the geographical H. erato races) individuals of each colour pattern race from admixed populations where the ranges of two colour pattern races overlap. Additionally, we collected individuals from 8 different closely related species including H. ricini, H. sara, H. charithonia, H. hecalesia, H. telesiphe, H. hortense, H. clysonimus, and H. hermathena (Fig. 1 and Supplementary Tables 10 and 11).
Whole-genome 100 bp paired-end Illumina resequencing data of these individuals was aligned to the H. erato v1 reference genome using BWA v0.7.13 73 with default parameters. PCR duplicated reads were removed using Picard v1.138 (http://picard.sourceforge.net) and sorted using SAMtools 74 . Genotypes were called using the Genome Analysis Tool Kit (GATK) Haplotypecaller 75 with default parameters. Individual genomic VCF records (gVCF) were jointly genotyped using GATK’s genotypeGVCFs with default parameters, except for setting expected heterozygosity to 0.025 to match the populations high heterozygosity and grouping individuals according to race and sampling location. Genotype calls were only considered in downstream analysis if they met the following criteria: quality (QUAL) ≥ 30, minimum depth ≥ 10, maximum depth ≤ 100 (to avoid false SNPs due to mapping in repetitive regions), overall depth ≤ 100 × number of samples, strand bias (FS) < 200, quality by depth ≥ 5, and for variant calls, genotype quality (GQ) ≥ 30.
Divergence and association analysis
We estimated levels of relative (FST)76 and absolute genetic divergence (dXY) 77 , and nucleotide diversity (π) 77 between populations in sliding windows using python scripts and egglib 78 . In all our analyses, we only considered windows for which at least 10% of the positions were genotyped for at least 75% of the individuals within each population. For the whole genome analysis of the seven hybrid zones, on average 96.4% (SD = 1.1%) of windows met these criteria. Genotype by phenotype (G × P) associations were tested for each variant position using a two-tailed Fisher’s exact test. Positions were excluded if less than 75% of individuals were genotyped for each phenotype. The sliding window approach and the identification of distinct blocks of associated SNPs provides a robust approach for identifying genomic regions of interests in our study system 79 .
We used FastTree v2.1 80 to infer an approximate maximum-likelihood phylogeny from the entire genome using the default parameters. In this analysis, we only used concatenated SNP data from chromosome 4–9, 11–14, 16, 17 and 20, because these chromosomes did not show any genetic divergence peaks in our population analysis. FastTree computes support values on nodes using the Shimodaira–Hasegawa test. Phylogenetic relationships of individuals across defined colour pattern intervals were constructed using maximum likelihood (ML) trees with RA×ML v8.0.26 81 . The best likelihood tree was chosen from 100 trees generated from a distinct starting tree using a GTR model with CAT approximation of rate heterogeneity and the support values of this tree was inferred with 100 bootstrap replicates.
We applied a phylogenetic strategy for identifying shared or conserved genomic intervals akin to ‘phylogenetic shadowing’ 82 . We evaluated the support for alternative phylogenetic hypotheses in the regions of peaks of divergence around colour pattern loci using a novel method called Twisst (topology weighting by iterative sampling of subtrees; https://github.com/simonhmartin/twisst) 28 . This method solves the problem of describing the relationships between groups that are not necessarily monophyletic. Given a tree and a set of pre-defined groups (in this case races) Twisst determines a weighting for each possible topology describing the relationship of the groups (for example, 6 groups yield 105 possible unrooted topologies and therefore 105 weightings). Topology weightings are determined by sampling a single member of each group and then identifying the topology matched by the resulting subtree. This sampling is iterated over a large number of subtrees and weightings are calculated as the frequency of occurrence of each topology. This method therefore reduces tree complexity caused by imperfect clustering of samples within groups. The ability to consider all possible topologies at each window provides an advantage over more commonly used likelihood ratio tests that only compare two topologies, which is especially relevant for taxa that have potentially many distinct evolutionary histories across their genomes. Weightings were estimated from 500 sampling iterations and averaged over ten bootstrap trees produced by RAxML v8.0.26 81 for each 2 kb window. Averaging weightings over bootstrap trees is expected to reduce false support for certain phylogenetic groupings from trees with low bootstrap support.
For phylogenetic weighting along the wntA (chromosome 10) and Ro (chromosome 13) interval, we compared weightings of topologies defined by samples from the following six groups: H. e. demophoon, H. e. etylus, H. e. notabilis, H. e. lativitta/emma, H. e. erato/amalfreda and H. e. hydara (FG). To partly control for the strong phylogeographic signal within H. erato, we focused these analyses on eastern Andean and Amazonian races, which also show the most variation in forewing band shape, size and position. For the optix (chromosome 18) interval, we compared weightings of topologies defined by samples from the following six groups: H. e. amalfreda, H. e. favorinus/hydara (FG), H. e. etylus/lativitta/emma/erato, H. himera, H. telesiphe and H. clysonymus/hortense/hecalesia. To obtain weightings for hypothesized phylogenetic groupings of specific colour pattern forms, we summed the counts of all topologies that were consistent with the hypothesized grouping.
Genotype weighting optix
We evaluated genotypic similarity of species/races to the reference “postman” haplotype using a sliding window analysis. The “postman” haplotype was defined based on the consensus of fixed SNPs between all ‘postman’ (H. e. demophoon, H. e. hydara (Panama), H. hydara (French Guiana), H. e. notabilis and H. e. favorinus) and all ‘rayed’ (H. e. erato, H. e. etylus, H. e. emma and H. e. lativitta) H. erato races. In total there were 264 fixed SNPs across a 69 kb window on chromosome 18 near optix. For each species/race evaluated, the proportion of SNPs that were identical to the postman haplotype was calculated over windows of ten fixed SNPs, with a minimum coverage of 3 SNPs called in all individuals. The window size and minimum coverage was chosen to best capture the turn-over of the genotypic similarity along the genomic interval.
Defining boundaries of colour pattern intervals
Our argument for identifying regulatory modules was hierarchical. The association peaks, or regions of the genome containing clusters of sites perfectly associated with wing pattern phenotype, marked the genomic intervals that probably contained the functional variation responsible for phenotypic differences. We further resolved these intervals combining data across independent transition zones. The rationale is that independent recombination events in the distinct locations break down the pattern of associations, except at those very narrow intervals responsible for pattern differences. Thus, in these areas individuals should group by colour pattern phenotype rather than geographic proximity, which is the pattern evident across the bulk of the genome. This is the basis of the Twisst analyses described above. Specific boundaries are defined by a combination of Twisst and G × P association. For example, near wntA and optix, we defined the boundary positions of the regulatory modules by overlaying the phylogenetic weighting with genotype tables of the fixed allelic differences in the hybrid zone comparisons. More precisely, at the regions where phylogenetic weighting support for phenotypic grouping shifted and increased rapidly, we conservatively identified the boundaries of the intervals by looking for patterns of shared genotypes between samples with similar phenotypes. It should be noted that this approach assumes a single origin for functional alleles that are shared across similar phenotypes and will miss regions where patterning alleles evolved independently. The boundaries of the regulatory modules near Ro and cortex were defined only using the fixed SNP associations because the geographic distribution of the phenotypes does not allow phylogenetic weighting to distinguish between geography and phenotypic grouping for these loci.
Sequencing data was submitted to the Sequence Read Archive (SRA) with BioProject accession PRJNA324415; genome assembly data: SAMN05578372 to SAMN05578377; RNAseq data: SRR616674 to SRR616691, SAMN05578182 to SAMN05578206; linkage map data: SAMN05572290 to SAMN05572390; and re-sequencing data: SAMN05224096 to SAMN05224211.
How to cite this article: Van Belleghem, S. M. et al. Complex modular architecture around a simple toolkit of wing pattern genes. Nat. Ecol. Evol. 1, 0052 (2017).
We thank A. Tapia for maintaining the H. erato genome line and for generating our mapping family, and M. Vargas and C. Rosales for Illumina library preparation. We acknowledge the University of Puerto Rico, the Puerto Rico INBRE grant P20 GM103475 from the National Institute for General Medical Sciences (NIGMS), a component of the National Institutes of Health (NIH); CNRS Nouraugues and CEBA awards (B.A.C.); National Science Foundation awards DEB-1257839 (B.A.C.), DEB-1257689 (W.O.M.), DEB-1027019 (W.O.M.); awards 1010094 and 1002410 from the Experimental Program to Stimulate Competitive Research (EPSCoR) program of the National Science Foundation (NSF) for computational resources; and the Smithsonian Institution. This research was supported in part by Lilly Endowment, Inc., through its support for the Indiana University Pervasive Technology Institute, and in part by the Indiana METACyt Initiative. The Indiana METACyt Initiative at IU is also supported in part by Lilly Endowment, Inc.
Supplementary Figures 1–35; Supplementary Tables 1–13