Genetics and evidence for balancing selection of a sex-linked colour polymorphism in a songbird

Colour polymorphisms play a key role in sexual selection and speciation, yet the mechanisms that generate and maintain them are not fully understood. Here, we use genomic and transcriptomic tools to identify the precise genetic architecture and evolutionary history of a sex-linked colour polymorphism in the Gouldian finch Erythrura gouldiae that is also accompanied by remarkable differences in behaviour and physiology. We find that differences in colour are associated with an ~72-kbp region of the Z chromosome in a putative regulatory region for follistatin, an antagonist of the TGF-β superfamily genes. The region is highly differentiated between morphs, unlike the rest of the genome, yet we find no evidence that an inversion is involved in maintaining the distinct haplotypes. Coalescent simulations confirm that there is elevated nucleotide diversity and an excess of intermediate frequency alleles at this locus. We conclude that this pleiotropic colour polymorphism is most probably maintained by balancing selection.

feather development and/or coloration, to a summary of the genes follistatin more directly regulates based on other studies, to analyses of sequence conservation across the Red locus and follistatin coding region. I suspect that the authors are entirely correct in suggesting that the fates of feather precursors are determined earlier in development, but this section of the paper does not hold together particularly well and does relatively little to advance our understanding of how the red versus black alleles generate their myriad phenotypic effects, other than to rule out structural changes in the follistatin gene itself.
The third aim of the paper includes characterizing patterns of genetic diversity, divergence, and phylogenetic history across the Red locus and flanking regions, and then testing through the use of simulations whether entirely neutral processes could have generated the observed data. The authors take great pains to develop an unbiased approach to comparing observed and simulated data (e.g., structuring the simulation to match the empirical sample and assuming, in turn, that either the red allele or black allele is derived) but I still have some concerns about what exactly this analysis reveals. The point above about clearly defining the "Red locus" is potentially important not just for making the data and results easier for readers to understand, but also because it may influence the estimate of recombination rate used in the simulation study. Based on all of the data presented, either recombination is suppressed within "Red locus" (i.e., within the region of elevated LD), or recombinant alleles experience strong negative selection, or both. If the goal is to distinguish between these alternatives, it seems to me that estimating and simulating recombination across the Red locus and flanking regions requires careful consideration of both the empirical data and the simulations and an explicit rationale for defining the limits of the Red locus. Of particular concern is to what extent the empirical estimate of recombination (i.e., a minimum of 58 events in the sample) reflects recombination events that occurred in the "flanking regions" and/or recombination events that involved two alleles from the same lineage, i.e., two black alleles or two red alleles, respectively. One might ask: 1) is recombination generally reduced in the Red locus?; or 2) is it reduced only in red/black heterozygotes?; or 3) is there no reduction in physical recombination rate but instead the pattern of LD is generated by selection against recombinants"; or 4) some combination of the above? As I understand it, the simulation assumes both neutrality and a particular recombination rate that applies to the entire Red locus and flanking regions. The data clearly do not fit this model. While neutrality seems unlikely given the biology of the system, it's not clear to me that rejection of the null leads to a strong inference of balancing selection rather than a more general inference that the history and processes generating the empirical data are more complicated than the null model. What is the direct/positive evidence of balancing selection? Is it possible to test the different hypotheses about recombination noted above using multiple simulations with different assumptions?
Finally, I remain puzzled about the overall explanation for the origin, dynamics and evolutionary maintenance of color polymorphism in Gouldian finches, though it may not be fair to hold this paper and this set of authors entirely responsible for that. For example, Pryke & Griffith (2007, J. Evol. Biol.) and Pryke (2010) seem to tell somewhat different stories about mate preferences (do all females prefer red or is a preference for an individual's "own morph" genetically determined?), whereas Bolton et al. 2017 raise significant questions about whether wild birds are affected by the huge differences in survival and apparent genetic incompatibilities observed in captive birds. Within the context of the present paper and looking back at Pryke (2010), a clear hypothesis about which combinations of alleles at which loci are incompatible seems to be lacking. The present paper shows little or no genome-wide divergence between morphs. If so, then there would seem to be little or no opportunity for incompatibility between the Z-linked Red locus and autosomal loci; or at least, little or no opportunity for avoiding such incompatibilities through mate choice/assortative mating -i.e., if head color does not predict genotypes at other loci. Incompatibility might occur between the red and black alleles but presumably that would only be expressed in heterozygous males, whereas Pryke (2010) showed that females from mixed pairs suffered the lowest fitness, consistent with Haldane's rule. I note that Pryke's (2010) analysis focused on offspring survival as a function of parental genotypes (and whether they matched or not) rather than offspring genotypes per se, making it somewhat difficult to interpret the results of that study. In the present study and given a lack of evidence for an inversion, the authors speculate that there is selection against recombinant alleles, implying that multiple elements within the red and black alleles, respectively, are co-adapted and must be maintained together in the same haplotype to avoid reduced fitness. It's not clear to me how that hypothesis would be reconciled with Pryke's results, but more generally, specific hypotheses about the nature of potential genetic incompatibilities in this system seem to be lacking.
All of this is potentially relevant to the question of what exactly can be inferred from the rejection of the null neutral model used in the authors' simulations. The present paper concludes that some form of balancing selection is in effect and discusses several alternative forms, but it seems possible that rejection of the null model might point to other differences between the simulated data and the empirical data (e.g., some mechanism other than an inversion that limits recombination).
I am returning the manuscript file with numerous questions, comments, and suggested edits embedded using track changes. Below, I will highlight several specific points that I think are important to address and will then provide some specific comments on the supplementary information document.
Comments keyed to line numbers: 99 and elsewhere: "core region" -Given the steep transition in Fst values and LD, is there really any useful distinction between "the region" and "the core region"? Moving through the rest of the manuscript, I found the term "core region" to be not particularly helpful and potentially confusing. If a larger region was identified in previous studies or with the RAD-seq data, why not go ahead and use the refined estimate provided here and consider the region of strongly elevated LD as "the Red locus" and the regions to the immediate left and right as flanking regions? 126-135: it's not entirely clear how this discussion of follistatin and the genes it regulates connects to the specific genes/pathways identified in the previous paragraph. 205-206: Isn't this result terribly problematic to the hypothesis that genetic incompatibilities are responsible for lower survival in the offspring of mixed-morph and mixed genotype pairs? There could be incompatibility between the Zr and ZR alleles at the red locus, but this should affect only the survival of heterozygous males. Why exactly do the offspring of mixed morph pairs suffer lower survival as a function of their parents' genotypes rather than as a function of their own genotypes? And, if the mophs are not differentiated genome-wide, what generates a higher likelihood of interlocus incompatibility in the female offspring of mixed morph pairs versus same morph pairs? 231: I am curious to know; 1) if recombination within the red locus is still detected if the red locus is defined as coinciding with the region of nearly maximal Fst and elevated LD, and 2) if the analysis can distinguish between recombination between alleles within one of the two divergent lineages versus recombination between divergent alleles. Put another way, how did recombination rate vary across the red locus? Is it just at the ends, beyond the "core"? And, does the analysis distinguish within lineage recombination from between lineage recombination? The strong LD at the red locus would seem to suggest that there is more than one selected/functional site across the locus. Otherwise, it's not clear how these patterns of divergence could be maintained without an inversion. 331: Herein lies the puzzle: what is the nature of these genetic incompatibilities assuming there is no population-level divergence between morphs? How does breeding between morphs generate incompatibilities if no other loci in the genome are associated with the red locus? 579: This seems problematic. Given recombination and given that the samples were allocated to groups based on genotype at the Red locus, a comparison between the two groups at the reference loci is not necessarily going to measure the same thing (e.g., maximal historical divergence for the Red locus but not for the reference loci). To make the comparison more comparable, why not compare the two maximally divergent alleles at each reference locus?
Supplementary doc: Line 86: gBGC -it's not clear how or if this is used in the following simulations/analyses. Lines 145-146: seems likely that there would be at least some recurrent mutation over such a large region.
Lines 137-165: I have two concerns/questions here. First, for the empirical data, is it possible to distinguish between (and/or estimate the number of) recombination events that occurred between two red alleles or between two black alleles (i.e., how many within allelic lineage recombination events versus between allelic lineage recombination events)? Second, this section does not clearly define the Red locus, but the next section mentions simulating data for "a 99,669 bp region (equivalent in length to the Red locus)." If recombination in the empirical data was estimated for a comparable region, which includes "flanking regions" of lower Fst and LD, how many of the minimum of 58 recombination events occurred in the "flanking regions" rather than in the ~72 kbp region of elevated Fst and LD? Based on all of the data presented, either recombination is suppressed in the "Red locus" (i.e., within the region of elevated LD), or recombinant alleles experience strong negative selection, or both. Is the goal to distinguish between these alternatives? It seems to me that estimating and simulating recombination across the Red locus and flanking regions requires careful consideration of both the empirical data and the simulations and an explicit rationale for defining the limits of the Red locus.
170: what defines the limits of the "locus" -why not define the locus as the core region? Why not say, "we simulated a 99,669 bp region representing the red locus (xx,xxx bp) plus flanking regions"? 286: see questions above.
Suppl. Fig. 1: does this figure plot RADseq or WGS results? Based on the methods section, WGS data were collected for a sample of 24 captive birds, of which 17 were analyzed. If this is WGS data (and based on all the other results presented), there should be ~14 consecutive 5kb windows on the Zchromosome with Fst close to 1. I don't see that in the figure. If these are RADseq results, where are the WGS results presented? Suppl. Fig. 2 Suppl. Fig. 5: What does the top panel labeled "OtherRefSeq" illustrate? Rather than "pre-calculated," which is vague, how about "PhastCon values obtained from…" I think the red box should highlight the ~72 kbp region of clearly elevated Fst and LD and that this region should be defined as the "Red locus" (see other comments).
Suppl. Fig. 9: descriptions for e, f, g, and h in legend are not labeled correctly Suppl. Fig. 15: is this tree based on complete coverage of the region from MiSeq data?
Suppl. Table 7: it would be helpful to highlight the two (one on each end?) fragments that span the transitions on left and right from high Fst and high LD to much lower values of each. It seems to me that it is only those two fragments that really matter in terms of refuting the hypothesis of an inversion.
Suppl. Table 9: what's the rationale for defining the Red locus as encompassing the entire ~100 kbp region? See comments above.
Reviewer #2: Remarks to the Author: There continues to be uncertainty and controversy regarding the relative importance of multilocus supergenes (typically associated with genomic regions of low recombination such as inversions) versus pleiotropic single locus "switches" in the evolution of color polymorphisms associated with correlated phenotypic traits. Lately, some supergene examples have been characterized and received a good deal of attention, notably in birds. In this MS Kim and colleagues present evidence for a roughly 72kb intergenic sequence, that likely plays a role in the regulation of the follistatin gene, as being the only consistent difference genomic between the red and black morphs of Australia's Gouldian finch, a species in which the morphs differ in morphology, behavior, and physiology as well as head color. This species and its polymorphism are famous and visually striking, but a bit enigmatic in that some of the results for more or less domesticated strains and laboratory populations seem not to hold up well in studies of wild birds. The authors additionally find no evidence of an inversion being associated with the region of high differentiation and the evidence against an inversion seems to me persuasive.
I can see no obvious or severe problems with the claims presented by the authors, and the MS is generally well written with few errors. One point that seems to me unclear is why balancing selection and an inversion are treated as strict alternatives. If a population is polymorphic for the presence or absence of an inversion, one must still explain the persistence of that polymorphism, and balancing selection could be the answer. Similarly, inversions may not be a perfect barrier to gene flow between homologous inverted and non-inverted regions of chromosomes. What they are really seeking to explain is strong linkage disequilibrium along a substantial length of chromosome, with the alternative forms associated with different complex phenotypes.
The main criticism one can make is that the story is somewhat incomplete. The authors were unable to characterize transcriptomic differences between the morphs that relate directly to the putative switch locus, thus the mode of action of the apparent switch has not been elucidated, and they have not attempted a manipulation of the locus in question, so its role is not fully established. The main question for the Editor then is whether the story as presented-which is clearly important and interesting-is of sufficiently broad interest and impact as to justify publication in Nature Communications.
Specific Comments: 39: comma after common 39: claims to be models or emerging models are regrettably common and often inadequately substantiated; better to make a more specific claim for importance or novelty 46: delete "to" 84: consistent seems strong with p=.11 … but not significantly different from expectation is reasonable 188: good to provide a citation here 254: better to explain a little more clearly how this scenario leads to frequency-dependence 263: run on sentence 267: sentence runs on some 280: family-wise? 293: a little more explanation of how to interpret c would be helpful 347: nice if this point could be fit into the main text, though clearly space is an issue Suppl 301: not Kim at al Suppl 448: why non-alphabetical legend organization? Correct? Suppl 458: 10a? Suppl 506: could mention lack of field evidence for assortative mating here Reviewer #3: Remarks to the Author: This is an impressive and well carried out study examining the genetic basis of a color polymorphism in Gouldian finches using a variety of approaches. The authors demonstrate that the divergence between the red and the black morph reside to a 72 kb region on the Z chromosome, consistent with earlier findings of this polymorphism being located in this region on Z. There are several interesting findings: firstly, given the many differences in a plethora of traits between the morphs, that this is not due to an inversion as seen in many other species but rather (most likely) due to pleiotropic effect. Second, that it locates to a intergenic region suggesting a regulatory role (also well worth keeping in mind that 60-70% of all GWAS hits are intergenic) rather than a coding sequence change. Third, using coalcent simulations, the authors demonstrate that this polymorphism is likely maintained by balancing selection and arose in the Gouldian finch lineage.
Overall, this ms is very well written and carried out, although I note that I do not have expertise in all the analyses performed by the authors. It provides a nice contrast to the recent findings of the involvment of inversions ('supergenes') causing polymorphisms in several bird species (ruffs, zebra finches etc).
From my reading the mapping and identification of the region seems very solid and it is very interesting that the same gene seem to be involved in color divergence in a different species pair (golden and blue winged warblers). Given the finding of a TE within the Red locus I did wonder if this could be linked to the polymorphism as well, even if it arose before divergence between the red and black morph it could facilitate divergence. This is interesting since TE is linked to color patterning and polymorphism in both Heliconius and Biston bettularia. It is also not clear how This is even more so given that the analysis to demonstrate balancing selection as the mechanism for the maintenance of this polymorphism require strong assortative mating, somethign which is not observed in natural populations. Should one not also expect accummulation of deleterious mutations within the region since recombination is essentially not taking place? This is something that you could test with the resequencing data?

Reviewer #1 (Remarks to the Author):
This is an interesting study on a fascinating bird species with a color polymorphism that is also associated with a suite of behavioral and physiological traits. The authors present a somewhat complicated array of genetic data sets that allow for a good characterization of the genomic structure of the "Red locus," which was previously mapped to a relatively narrow region on the Z-chromosome. They also present an analysis of gene expression in developing black versus red head feathers (albeit based on small sample size), and conduct a simulation study to test whether the observed data could have been generated by neutral processes.
We thank the reviewer for their generous and enthusiastic comments. The suggestions on how to improve the manuscript are helpful, and we have taken these on board. There were a substantial number of comments from Reviewer #1, so we have added a Table below to answer them as fully as possible.
On the first point of characterizing patterns of genetic diversity and divergence at the Red locus, the data are robust and yield an interesting result: the 'red' and 'black' alleles map to a ~72 kbp region of strongly elevated divergence and linkage disequilibrium with a relatively sharp transition to baseline levels on both ends. This pattern is what one might expect from an inversion, but long range PCR products spanning the edges of the ~72 kbp region reveal no evidence of an inversion. I highlight a few concerns about this aspect of the study below: 1) Difficulty keeping track of all the different data sets. I gather that this study represents a collaboration that developed from what were originally two or perhaps three independent research efforts. Whether that is an accurate inference or not, it is often difficult to follow which data set is being used for a given result/analysis. Likewise, it is difficult to keep track of the level of genomic coverage and quality (e.g., Sanger versus next gen and at what sequencing depth) that each data set provides. It also seems that results for some data sets are never presented (e.g., the WGS data and results for the M2 and M3 markers).
Partial solutions to this problem might include: 1) a supplementary table that summarizes the different data sets and includes basic information about population sources, sample sizes, as well as number of loci, overall length of the sequences included, and/or genomic coverage and sequencing depth as appropriate, and 2) identification in each figure legend of the data set used, both in the main paper and the supplementary information doc. 2) Definition of the "Red locus" versus "core of the Red locus." It was not clear to me that "Red locus" was consistently defined throughout the paper and in all of the analyses presented. More importantly, I did not see any value in distinguishing between the locus and the core of the locus. I would strongly recommend that the "Red locus" be defined as being equivalent to the region of strongly elevated Fst and LD (the limits of which can be based on some objective criterion/criteria) and that adjacent sequences upstream and downstream of that region be referred to as "flanking regions." I suspect that the larger region included as part of the "Red locus" reflects either a previous, approximate estimate of its location based on the earlier mapping study, or operational decisions about regions to sequence and include in various analyses. What biological reality argues for including as part of the Red locus anything outside of the region of elevated Fst and LD?
The definition of a 'locus' is often arbitrary, particularly when the causal site is not clearly identified. In a previous linkage mapping study (Kim et al 2016), the Red locus was mapped within a 7.2-cM (~8 Mb) interval, but the current study has refined it to an ~100kbp region including the ~72 kbp highly differentiated core regions.
In fact, we initially considered defining the Red locus as lying within the narrower interval, but it is difficult to clearly define the boundary between the core region and the remainder of the 100kbp region without identifying breakpoints ( However, we do agree that the reviewer's concern regarding the potential differences in recombination rate between the core and the flanking regions is relevant, so we now provide further information about the test we conducted before concluding that using the current definition in the subsequent sections is appropriate (see the response to the comments on lines 137-165, 145-146, and 231 of the original manuscript, below).
We agree that the extra "flanking region" upstream of (and not contiguous with) the Red locus does not add much informationit was not used in the simulations. Therefore, we have removed reference to this locus from the main text and Tables (Supplementary Table 4 409-416 (2016).
3) The 9-locus microsatellite data set is not particularly impressive in comparison to the other data sets presented and provides relatively little power for detecting subtle patterns of population structure, if they exist. While the sampling for the RADSeq data is rather odd (i.e., 22 black males versus 8 red females and two red males), I would be interested to see what those data indicate about genome-wide or autosomal divergence between the color morphs (e.g., using the fineRADstructure software). This is a key question because previous papers on this system (e.g., Pryke & Griffith 2009 Evolution) seemed to assume some level of population differentiation between the morphs that would provide the basis/opportunity for genetic incompatibilities among loci (see below).
We used microsatellite data as this was an efficient way to test for population structure between the two morphs, given the reasonably large sample size. The initial sampling scheme for the RADSeq analysis (10 black males, 8 red females, and 2 red males) aimed to locate the Red locus more precisely, but again efficiently, given a restricted budget, rather than to make more detailed population genetic inference. At the Red locus, black males should be homozygous for the recessive (black) allele and red females should be hemizygous for the dominant allele. We did not know the genotypes of the two red males, which could be either heterozygous or homozygous for red. By treating red females as homozygous we were able to apply the dominance model in the GWAS analysis using plink, and so successfully detect the location of the Red locus. Twelve additional black male samples were available from other experiments and were subsequently included to increase the power further.
Although the Z chromosome sampling was not balanced for the population genetic inference, as the reviewer suggests we can indeed assume the autosomal chromosomes were randomly chosen with respect to Red, so that they can be used in the population genetic analysis.
To confirm our STRUCTURE analysis, we have now performed the suggested fineRADstructure analysis using the autosomal RAD loci ( Supplementary Fig. 6). Except for clusters caused by batch effects (amount of missing data per library), there was no population structure with regard to the head colour, consistent with the STRUCTURE analysis. As we further discuss below, and as the reviewer points out, testing for the genetic differentiation between morphs was an important indirect test for incompatibilities. Our results show that there is no widespread genomic differentiation between morphs.
As noted, the gene expression data are based on small sample size, which prevents any statistically robust inferences about differences between the morphs. Moreover, the results section on gene expression transitions from identification of putative DE genes and pathways that might be involved in feather development and/or coloration, to a summary of the genes follistatin more directly regulates based on other studies, to analyses of sequence conservation across the Red locus and follistatin coding region. I suspect that the authors are entirely correct in suggesting that the fates of feather precursors are determined earlier in development, but this section of the paper does not hold together particularly well and does relatively little to advance our understanding of how the red versus black alleles generate their myriad phenotypic effects, other than to rule out structural changes in the follistatin gene itself.
We agree that the gene expression data are of low power due to the small sample size but we provide the result as a suggestive list for future studies. However, it is interesting that multiple differentially expressed genes belong to a functionally correlated Gene Ontology network that has the potential to control feather development and pigmentation. We have now moved the description of the annotation network to the caption of Supplementary Figure 3, so reducing this aspect in the main text, and rephrased the main text to highlight the upstream position of FST in the regulatory network.
The third aim of the paper includes characterizing patterns of genetic diversity, divergence, and phylogenetic history across the Red locus and flanking regions, and then testing through the use of simulations whether entirely neutral processes could have generated the observed data. The authors take great pains to develop an unbiased approach to comparing observed and simulated data (e.g., structuring the simulation to match the empirical sample and assuming, in turn, that either the red allele or black allele is derived) but I still have some concerns about what exactly this analysis reveals.
The point above about clearly defining the "Red locus" is potentially important not just for making the data and results easier for readers to understand, but also because it may influence the estimate of recombination rate used in the simulation study. Based on all of the data presented, either recombination is suppressed within "Red locus" (i.e., within the region of elevated LD), or recombinant alleles experience strong negative selection, or both. If the goal is to distinguish between these alternatives, it seems to me that estimating and simulating recombination across the Red locus and flanking regions requires careful consideration of both the empirical data and the simulations and an explicit rationale for defining the limits of the Red locus.
Of particular concern is to what extent the empirical estimate of recombination (i.e., a minimum of 58 events in the sample) reflects recombination events that occurred in the "flanking regions" and/or recombination events that involved two alleles from the same lineage, i.e., two black alleles or two red alleles, respectively. One might ask: 1) is recombination generally reduced in the Red locus?; or 2) is it reduced only in red/black heterozygotes?; or 3) is there no reduction in physical recombination rate but instead the pattern of LD is generated by selection against recombinants"; or 4) some combination of the above?
Regarding the various comments on the recombination rate, see the responses to the comments on lines 137-165, 145-146, and 231, below. As I understand it, the simulation assumes both neutrality and a particular recombination rate that applies to the entire Red locus and flanking regions. The data clearly do not fit this model. While neutrality seems unlikely given the biology of the system, it's not clear to me that rejection of the null leads to a strong inference of balancing selection rather than a more general inference that the history and processes generating the empirical data are more complicated than the null model. What is the direct/positive evidence of balancing selection? Is it possible to test the different hypotheses about recombination noted above using multiple simulations with different assumptions?
Unfortunately, we cannot use the procedure we employed to simulate long-term balancing selection, because the methods that we used (SelSim, mbs) break down in this sort of scenario.
However, rejecting the neutral model is the starting point in any attempt to look for evidence for selection (Casillas and Barbadilla, 2017), and is the null model in widely used methods such as Tajima's D. Since we observe a large excess of pi, a highly positive Tajima's D, and unusually high divergence between alleles, over and above the effects of the sampling scheme, these are "positive evidence of balancing selection".

We are a little confused by the statement "the data clearly do not fit [the neutral] model", because the situation is complicated by the non-random sampling scheme at the
We also do not think that it is possible to test the different hypotheses about recombination that the reviewer suggests: first, we think that they are mostly a function of a local reduction in the recombination rate, but second, it is not possible to incorporate different rates of recombination for the different classes using the coalescent simulation machinery that we employ. incompatible seems to be lacking. The present paper shows little or no genome-wide divergence between morphs. If so, then there would seem to be little or no opportunity for incompatibility between the Zlinked Red locus and autosomal loci; or at least, little or no opportunity for avoiding such incompatibilities through mate choice/assortative matingi.e., if head color does not predict genotypes at other loci. Incompatibility might occur between the red and black alleles but presumably that would only be expressed in heterozygous males, whereas Pryke (2010) showed that females from mixed pairs suffered the lowest fitness, consistent with Haldane's rule. I note that Pryke's (2010) analysis focused on offspring survival as a function of parental genotypes (and whether they matched or not) rather than offspring genotypes per se, making it somewhat difficult to interpret the results of that study. In the present study and given a lack of evidence for an inversion, the authors speculate that there is selection against recombinant alleles, implying that multiple elements within the red and black alleles, respectively, are co-adapted and must be maintained together in the same haplotype to avoid reduced fitness. It's not clear to me how that hypothesis would be reconciled with Pryke's results, but more generally, specific hypotheses about the nature of potential genetic incompatibilities in this system seem to be lacking.

Although the genetic basis of the incompatibilities observed in the captive population is
beyond the scope of our study, one of the goals in this paper was to indirectly test if the incompatibilities described in captive populations are occurring in the wild population by examining the pattern of genomic differentiation. Clearly, our results including the test for HWE (Supplementary Table 3 We can only speculate that artificial selection and/or inbreeding in the captive population might have led to the discrepancy between the captive population studied by Pryke et al. and the wild population. First, it is noteworthy that LD in the captive population we studied here has significantly increased compared to that in the wild population. In our previous linkage mapping study (Kim et al, 2016)from the captive population that was also the source of the birds used for the studies of incompatibility (Pryke et al. 2009 793-798 (2009).
All of this is potentially relevant to the question of what exactly can be inferred from the rejection of the null neutral model used in the authors' simulations. The present paper concludes that some form of balancing selection is in effect and discusses several alternative forms, but it seems possible that rejection of the null model might point to other differences between the simulated data and the empirical data (e.g., some mechanism other than an inversion that limits recombination).
I am returning the manuscript file with numerous questions, comments, and suggested edits embedded using track changes. Below, I will highlight several specific points that I think are important to address and will then provide some specific comments on the supplementary information document.
Comments keyed to line numbers: * The line numbers used by Reviewer #1 differed from our original manuscript (NCOMMS-18-

10766-T -18034016).
We have inferred from the comments what was being referenced and have used the reviewer's line numbers as references in the response. 99 and elsewhere: "core region" -Given the steep transition in Fst values and LD, is there really any useful distinction between "the region" and "the core region"? Moving through the rest of the manuscript, I found the term "core region" to be not particularly helpful and potentially confusing. If a larger region was identified in previous studies or with the RAD-seq data, why not go ahead and use the refined estimate provided here and consider the region of strongly elevated LD as "the Red locus" and the regions to the immediate left and right as flanking regions.

See above response.
126-135: it's not entirely clear how this discussion of follistatin and the genes it regulates connects to the specific genes/pathways identified in the previous paragraph.
We have now moved an entire section about the transcriptomic data to the caption of Supplementary Fig. 3, and briefly mention that there were some DE genes known to interact with FST. We have rephrased the test to state that multiple functional differences may require expression differences for genes that act upstream of the regulatory network in an earlier developmental stage. 205-206: Isn't this result terribly problematic to the hypothesis that genetic incompatibilities are responsible for lower survival in the offspring of mixed-morph and mixed genotype pairs? There could be incompatibility between the Zr and ZR alleles at the red locus, but this should affect only the survival of heterozygous males. Why exactly do the offspring of mixed morph pairs suffer lower survival as a function of their parents' genotypes rather than as a function of their own genotypes? And, if the morphs are not differentiated genome-wide, what generates a higher likelihood of interlocus incompatibility in the female offspring of mixed morph pairs versus same morph pairs?
We agree with the reviewer that these results are somewhat unexpected, given the reports of red/black incompatibilities in previous studies. In this study, we made no assumptions about incompatibilities between morphs in the wild but tested if the incompatibilities observed in captivity had led to genome-wide differentiation between morphs in the wild. Therefore, we do not think the result is itself problematic. However, our results, along with Bolton et al. (2017) suggest a new hypothesis: that the discrepancies are due to differences between the wild and captive populations, as discussed above. We now address this point explicitly in the discussion. * Bolton, P. E. et al. The colour of paternity: extra-pair paternity in the wild Gouldian finch does not appear to be driven by genetic incompatibility between morphs. J. Evolution. Biol. 30, 174-190 (2017). 231: I am curious to know; 1) if recombination within the red locus is still detected if the red locus is defined as coinciding with the region of nearly maximal Fst and elevated LD, and 2) if the analysis can distinguish between recombination between alleles within one of the two divergent lineages versus recombination between divergent alleles. Put another way, how did recombination rate vary across the red locus? Is it just at the ends, beyond the "core"? And, does the analysis distinguish within lineage recombination from between lineage recombination? The strong LD at the red locus would seem to suggest that there is more than one selected/functional site across the locus. Otherwise, it's not clear how these patterns of divergence could be maintained without an inversion.  LD (41 / 69,[1][2][3][4], compared to the whole region (58 / 99,699 = 5.82e-4). We recognise that this is a legitimate concern and present these results in Supplementary 579: This seems problematic. Given recombination and given that the samples were allocated to groups based on genotype at the Red locus, a comparison between the two groups at the reference loci is not necessarily going to measure the same thing (e.g., maximal historical divergence for the Red locus but not for the reference loci). To make the comparison more comparable, why not compare the two maximally divergent alleles at each reference locus?
We think that the test suggested by the reviewer is likely to be extremely conservative, because it is inconsistent with how our data were generated (our sampling was based on the alleles' association with morphs only, without any regard to divergence or any sequence-based statistics).  Figure 3 in the main text, but with rho = 0.0002 as opposed to rho = 0.004, which is a reduction of more than an order of magnitude. A reduction compared to the estimated value of rho is what we expect if some of the recombination events we infer using R M were due to recurrent mutation. As an aside, 0.0002 is approximately the value of rho inferred by LDhat, both within the red and within the black allelic classes at the Red locus. LDhat allows recurrent mutation. Please bear in mind, though, that the strong LD between the allelic classes prevented us from using LDhat to assay recombination in the combined sample at the Red locus, as we originally stated in the supplementary information.
As is apparent from the figures below, the observed polymorphism at the Red locus lies outside the 95% CIs from the simulations, even with this reduced recombination rate. This is true for a derived allele frequency of 0.144 (panels A-D) and a DAF of 0.856 (panels E-H).
Lines 137-165: I have two concerns/questions here. First, for the empirical data, is it possible to distinguish between (and/or estimate the number of) recombination events that occurred between two red alleles or between two black alleles (i.e., how many within allelic lineage recombination events versus between allelic lineage recombination events)?  Second, this section does not clearly define the Red locus, but the next section mentions simulating data for "a 99,669 bp region (equivalent in length to the Red locus)." If recombination in the empirical data was estimated for a comparable region, which includes "flanking regions" of lower Fst and LD, how many of the minimum of 58 recombination events occurred in the "flanking regions" rather than in the ~72 kbp region of elevated Fst and LD? Based on all of the data presented, either recombination is suppressed in the "Red locus" (i.e., within the region of elevated LD), or recombinant alleles experience strong negative selection, or both. Is the goal to distinguish between these alternatives? It seems to me that estimating and simulating recombination across the Red locus and flanking regions requires careful consideration of both the empirical data and the simulations and an explicit rationale for defining the limits of the Red locus.
Regarding recombination across different parts of the region under consideration, see the response to the comment above as well as the response to Referee 1's the comment on line 231 of the original manuscript. These suggest that our treatment of recombination at the Red locus for the purposes of testing our null model was conservative.
We do not claim to be able to distinguish between alternative explanations for an effective or real suppression of recombination across this region -and we do not think that this is possible with population genetic data alone. Our goal is to incorporate recombination into our analyses to a best approximation in an effort to test our null hypothesis of neutral evolution, not to conduct model choice between the two alternatives that the reviewer presents. We would also like to reiterate the robustness of our results to a recombination rate even an order of magnitude below the one that we infer using RM -see our response to Referee 1's comment on lines 145-156, above. 170: what defines the limits of the "locus" -why not define the locus as the core region? Why not say, This was a result from WGS using a sliding-window approach. Unfortunately, the nearest available windows flanking the single window (5kbp from 46,580,001 bp) on the Red locus were located at 310 kbp and 410 kbp away from the focal window, respectively. Supplementary Fig. 1. The sources of the data are provided in the figure caption. After filtering, there were only two SNPs within the core of the Red locus for each (RADSeq and WGS) dataset, so the pattern discovered in Fig. 2 could not be obtained using these data.

We now present site-by-site F ST from both RADSeq (wild population) and WGS (captive population), and the frequency distribution of F ST values in
Suppl. Fig. 2: does this fig show all RAD loci in the ~5Mbp region covered or only selected loci/SNPs? If the latter, how were they selected? What do the results for M2 and M3 look like? Are M2 and M3 also perfectly (or nearly perfectly) associated with phenotype?

This figure shows all RAD loci in the ~5 Mbp interval around the Red locus. Due to the density of the RAD loci there is only one RADSeq locus (with two SNPs) that overlaps with the Red locus.
We have now included the genotype results for M1~M3 in Supplementary Fig. 2

. Two recombination events between black-and red-linked alleles are visible in the genotypes of red males. A single black female that has red-linked alleles for all three markers is likely to represent a phenotyping error due to the less bright red head colour of females in the wild population.
Suppl. Fig. 4: does this gene have a single exon?

More details have been added to the caption of Supplementary Figure 4.
Suppl. Fig. 5: What does the top panel labeled "OtherRefSeq" illustrate? Rather than "pre-calculated," which is vague, how about "PhastCon values obtained from…" I think the red box should highlight the ~72 kbp region of clearly elevated Fst and LD and that this region should be defined as the "Red locus" (see other comments).

We have now added descriptions for the top panel and the source of PhastCon values, as
suggested. We have removed the panel "OtherRefSeq", which shows gene models for other species to reduce confusion. We have now added the position of the 'core' of the Red locus in the top panel.
These are now correctly labelled Suppl. Fig. 15: is this tree based on complete coverage of the region from MiSeq data? This is based on the entire core region from MiSeq data.
Suppl. Table 7: it would be helpful to highlight the two (one on each end?) fragments that span the transitions on left and right from high Fst and high LD to much lower values of each. It seems to me that it is only those two fragments that really matter in terms of refuting the hypothesis of an inversion.

Approximate position of the 'core' of the Red locus is now shaded.
Suppl. Table 9: what's the rationale for defining the Red locus as encompassing the entire ~100 kbp region? See comments above.
See above response for the definition of the Red locus. In captive populations, prezygotic incompatibility was observed in the form of assortative mating. Currently, there is no strong evidence of assortative mating in the wild population. We have made this clear by adding "but these were not fully tested in the wild population".
3 79 near-perfect Comment [A3]: A single "CA" female scored as black? Is that the only exception? If so, might be good to say something like "with one exception" Accepted. Added "with one exception in a female".

80
Supplementary There is no strong evidence of assortative mating in the wild population and our results do not support it. This is now clarified in the introduction and also mentioned in the discussion. the "core" region Comment [A8]: Given the steep transition in Fst values and LD, is there really any useful distinction between the "region" and the "core region"? Moving through the rest of the manuscript, I found the "core region" distinction to be not particularly helpful and potentially confusing. If a larger region was delineated in previous studies, why not go ahead and use the refined estimate provided here and consider the region of strongly elevated LD as "the region" and sequence to the immediate left and right as flanking regions?
Throughout the manuscript the "Red locus" is the full ~100 kbp sequenced region using MiSeq that contains the ~72 kbp "core region" that shows high divergence. Because we did not find any breakpoints or causal sites that distinguish the core and the regions adjacent to the stretch of high LD, we have left the definition of Red as the entire stretch of contiguous sequence obtained from the MiSeq. Because we did not find differences in recombination rates between this entire region and the 'core', we use parameters estimated using the entire region in the simulation. We believe this is a more conservative approach than using only the highly divergent region. See the main responses.
9 99 using long-range PCR Comment [A9]: This long range PCR effort is impressive but it does not extend very far on either end of the region of elevated divergence, and the primer pair on the right end is considered part of the "locus" and is only a couple thousand base pairs beyond the "core" of the locusassuming that the pattern of LD was know first, it seems like a better design would have been to focus on the ends rather than the middle. In any case, redefining "the region" or "the locus" as suggested in the previous comment may help to show that the long range PCR products effectively covered the ends of the locus. Answered in main response. In short, we did not make the assumption that there should be incompatibilities in the wild population. We examined if incompatibilities described in the captive population are occurring in the wild population and found no support for the incompatibilities in the wild population. We think the discrepancy is caused by the differences between wild and captive population (see main response).

31 207
recombination within the Red locus Comment [A31]: I am curious to know; 1) if recombination within the red locus is still detected if the red locus is defined as coinciding with the region of nearly maximal Fst and elevated LD, and 2) if the analysis can distinguish between recombination between alleles within one of the two divergent lineages versus recombination between divergent alleles. Put another way, how did recombination rate vary across the red locus? Is it just at the ends, beyond the "core"? And, does the analysis distinguish within lineage recombination from between lineage recombination? The strong LD at the red locus would seem to suggest that there is more than one selected/functional site across the locus. Otherwise, it's not clear how these patterns of divergence could be maintained without an inversion.
We added new Supplementary Table 13 showing the recombination in the region encompassing the Red locus and compare the entire Red locus and the core region. There are negligible differences between per base recombination rate (In fact, the rate in the core region is slightly higher). Because we don't know the exact location of the causal site(s) and the proportion of the Red locus that is outside the region of elevated LD is low, we define the entire sequenced region as the Red locus. We also consider the multiple selected sites as a parsimonious explanation but we have chosen not to emphasise this because we don't yet know the causal site(s).

209 selection against recombinants
Comment [A32]: This also seems to imply that there are multiple functionally relevant sites across the red locus and that recombination of these sites generates incompatibilities.
We do consider multiple functional sites as a parsimonious explanation for the observed pattern but without knowing the exact sites we don't make a significant claim here.

211
the core of the Red locus Comment [A33]: Seems like a different definition of "core" here. Again, it would be better to simplify by equating the red locus to the "core region" and referring to adjacent sequences as flanking regions.
Throughout the manuscript the Red locus is the full 100 kbp sequenced region that contains the ~72 kbp core region with high divergence. The "core" in the respective sentence has the same meaning. These are twenty-six short fragments (~1 kbp each) across the Red locus. The distribution of all SNPs found in these fragments is shown in the middle panel of Figure 1c. Sequence data that cover the entire region of the Red locus were obtained using the MiSeq data. This is now explained in the figure caption and Methods. Were the two runs reasonably consistent in recovery of loci? What was average depth per locus per individual? Why is nothing more done with these data other than the plot in Fig 1a? E.g., PCA or STRUCTURE analysis of the RAD-seq data would be substantially more powerful for the analysis of possible population structure than 9 microsatellites.
We meant minimum depth for each site per individual (> 2) and maximum missing genotypes over all individuals (< 30%). HiSeq2000 runs had more coverage than GAIIx runs. However, by applying a filter based on missing genotypes, we could effectively remove sites that were found only in the HiSeq2000 run. The mean depth per locus per individual differed between libraries (GAIIx: L1 = 3.2, L2 = 7.8; HiSeq2000: L3 = 13.2). However, because we mixed equal number of black and red birds in L1 and L2, any bias due to the coverage in the inference of structure between morphs will be limited. We have rephrased this section now. Originally, we did not design the RADSeq for population genetic analysis. We wanted to find the causative region on the Z chromosome at the lowest cost. We agree that the RADSeq data can be used for further analysis so we have now added the suggested fineRADstructure analysis.

378
This resulted in a mean of 18.7 million reads per individual (range 11.4-26.9 million reads) We thank the reviewer for pointing this out. We were not careful in distinguishing paired and collapsed reads.
We have now been explicit and reported the total number of reads to avoid confusion: Average 33.7 million total reads, which is approximately 3.6x coverage.
58 389 used a minor allele frequency (MAF) filter of 5% Comment [A58]: This is not a good approach for possible analyses that rely on the allele frequency distribution or rare alleles specificallye.g., recent co-ancestry can be assessed using rare alleles (e.g., using fineRADstructure for RAD data) We acknowledge that this cut off is not ideal, but when working with the low coverage data that we generated it was prudent to be cautious. Our WGS data were from a captive population, so are not suitable for the population genetic analyses to detect population structure etc. This dataset was used only in detecting outlier loci via F ST to identify and confirm the location of the Red locus.
We have rephrased this part as "To properly correct for the ploidy of genotypes identified on the Z chromosome, we used the vcf-fix-ploidy function of VCFtools to assign one allele at each site for Z-linked loci in females. To increase the chance of detecting the location of the Red locus in an F ST outlier analysis ( Supplementary Fig. 1), we only included data from 17 individuals whose genotypes were known to be homozygous or hemizygous for alternative alleles at the Red locus based on the pedigree (black: n black male = 4, n black female = 3; red: n red female = 5 and n yellow female = 5)".
60 396 mean FST Comment [A60]: For the Z-chromosome, was this calculated with appropriate consideration of ploidyi.e., total of 11 black chromosomes and 10 red chromosomes?
We used the vcf-fix-ploidy function of VCFtools to assign one allele at each site for the genotypes of hemizygous Z-linked loci in females. We obtained the allele frequencies based on the number of chromosomes we sampled (e.g. total of 11 and 10 alleles for black and red birds respectively for Z-chromosome) and used these in the calculation of F ST (see Methods section 'Estimates of DNA polymorphism and divergence' for calculation of F ST ). We now present site-by-site F ST scans for wild and captive populations in Supplementary  Fig. 1 We meant that we used both RADSeq and WGS, but did not combine these data. The RADSeq result is presented in Fig. 1 and WGS data are shown in Supplementary  Figure 1. In "identification of the Red locus" section, we deleted "a combination of ", for clarity. In the GWAS result ( Fig. 1)  These are twenty-six short fragments (~1kbp each) across the Red locus. The distribution of all SNPs found in these fragments is shown in the middle panel of Figure 1c. Sequence data that cover the entire region of the Red locus were obtained using the MiSeq. Throughout the manuscript the "Red locus" is the full ~100 kbp region sequenced using the MiSeq that contains the ~72 kbp "core region" with high divergence. Because we didn't find any breakpoints nor causal sites that distinguish the core and the flanking regions, the definition of the Red locus is to a degree arbitrary. Because we did not find significant differences in the recombination between entire regions and the core, we use parameters estimated using the entire region in the simulation. We believe this is a more conservative approach than using only the highly divergent region. See main responses.

493
One additional sequence stretch (henceforth the 'flanking' region), approximately 7 kb long, was obtained for the same samples adjacent to the Red locus (Supplementary Table 6).

Comment [A75]
: What is the value of this "disconnected" sequence on one side of the locus, and again, what is the distinction between the locus and the "core" of the locus?
We agree that including the extra 'flanking region' is confusing, and that this is not used for any analysis beyond as a comparator for population genetic summary statistics. We have therefore removed reference to it, for clarity. In our simulations, we used sequence data from females of both morphs. There is a danger that studies on polymorphism fail to correct the bias due to the nonrandom sampling that otherwise results in association studies based on a focal phenotype. Using population genetic parameters estimated from the non-random sample would violate the assumptions of many common population genetic tests. Here, we provide a way to deal with such biases. We are referring to the sampling scheme of 6 + 6 red and black birds sampled from a wild population where the allele frequencies in nature are 0.144 and 0.856. We have amended the text at this point to make this more explicit, and reordered the methods section so that the part detailing with our consideration of non-random sampling precedes this section. We have modified the text to be more explicit about what parameters we meant -theta and rho. We are interested in testing our null hypothesis of neutral evolution + non-random sampling scheme, but are not carrying out parameter estimation. To this end we determined a value of theta from the reference loci and a value of rho from the RM test at the Red locus in order to parameterise our simulations. This concern is also responded to in the response to the reviewer's letter.

568
for the sequence of the entire core region of the Red locus Comment [A85]: Why "the entire core region" here when the data were based on the M1-M3 loci? How long are the fragments "amplifiable" using these primer pairs?
We wanted to have the best model for the entire core of the Red locus and assumed the same model can be applied to other parts of the region. This model was used to construct trees in Figure 4, Supplementary Figs. 13~15 which use parts or entire core region.

-Main.pdf".
There continues to be uncertainty and controversy regarding the relative importance of multilocus supergenes (typically associated with genomic regions of low recombination such as inversions) versus pleiotropic single locus "switches" in the evolution of color polymorphisms associated with correlated phenotypic traits. Lately, some supergene examples have been characterized and received a good deal of attention, notably in birds. In this MS Kim and colleagues present evidence for a roughly 72kbp intergenic sequence, that likely plays a role in the regulation of the follistatin gene, as being the only consistent difference genomic between the red and black morphs of Australia's Gouldian finch, a species in which the morphs differ in morphology, behavior, and physiology as well as head color. This species and its polymorphism are famous and visually striking, but a bit enigmatic in that some of the results for more or less domesticated strains and laboratory populations seem not to hold up well in studies of wild birds. The authors additionally find no evidence of an inversion being associated with the region of high differentiation and the evidence against an inversion seems to me persuasive.
I can see no obvious or severe problems with the claims presented by the authors, and the MS is generally well written with few errors.
We thank the reviewer for their kind and supportive comments. Below we address specific points.
One point that seems to me unclear is why balancing selection and an inversion are treated as strict alternatives. If a population is polymorphic for the presence or absence of an inversion, one must still explain the persistence of that polymorphism, and balancing selection could be the answer. Similarly, inversions may not be a perfect barrier to gene flow between homologous inverted and non-inverted regions of chromosomes. What they are really seeking to explain is strong linkage disequilibrium along a substantial length of chromosome, with the alternative forms associated with different complex phenotypes.
We agree with the suggestion that an inversion and balancing selection should not be presented as alternatives, and have amended the abstract accordingly.
The main criticism one can make is that the story is somewhat incomplete. The authors were unable to characterize transcriptomic differences between the morphs that relate directly to the putative switch locus, thus the mode of action of the apparent switch has not been elucidated, and they have not attempted a manipulation of the locus in question, so its role is not fully established. The main question for the Editor then is whether the story as presented-which is clearly important and interesting-is of sufficiently broad interest and impact as to justify publication in Nature Communications.
We agree that it would be fascinating and important to explore the molecular function of the candidate gene we found, but also that it is beyond the scope of this study.  Figure 3a-d.). So, we have chosen to refer to the simulations presented in Figure 3 at this point.  1219-1229 (2005).
254: better to explain a little more clearly how this scenario leads to frequency-dependence It is now clarified that the high frequency of red males creates a competitive and stressful environment and that red males suffer more from those conditions.

263: run on sentence
We have split the sentence into two.

267: sentence runs on some
We have split the sentence into two. The order of the legend has been corrected.
It is now Supplementary Fig. 11a (previously Supplementary 10a) that shows a uniform distribution.
Suppl 506: could mention lack of field evidence for assortative mating here We have now clarified that these incompatibilities were not fully tested in a wild population in the introduction.

Reviewer #3 (Remarks to the Author):
This is an impressive and well carried out study examining the genetic basis of a color polymorphism in Gouldian finches using a variety of approaches. The authors demonstrate that the divergence between the red and the black morph reside to a 72 kbp region on the Z chromosome, consistent with earlier findings of this polymorphism being located in this region on Z. There are several interesting findings: firstly, given the many differences in a plethora of traits between the morphs, that this is not due to an inversion as seen in many other species but rather (most likely) due to pleiotropic effect. Second, that it locates to a intergenic region suggesting a regulatory role (also well worth keeping in mind that 60-70% of all GWAS hits are intergenic) rather than a coding sequence change. Third, using coalescent simulations, the authors demonstrate that this polymorphism is likely maintained by balancing selection and arose in the Gouldian finch lineage.
Overall, this ms is very well written and carried out, although I note that I do not have expertise in all the analyses performed by the authors. It provides a nice contrast to the recent findings of the involvement of inversions ('supergenes') causing polymorphisms in several bird species (ruffs, zebra finches etc).
We thank the reviewer for their generous and enthusiastic comments. The suggestions on how to improve the manuscript are helpful, and we have taken these on board.
From my reading the mapping and identification of the region seems very solid and it is very interesting that the same gene seem to be involved in color divergence in a different species pair (golden and blue winged warblers). Given the finding of a TE within the Red locus I did wonder if this could be linked to the polymorphism as well, even if it arose before divergence between the red and black morph it could facilitate divergence. This is interesting since TE is linked to color patterning and polymorphism in both Heliconius and Biston bettularia.
We agree that there is a possibility that the inserted TE might have had a role in the polymorphism given the highest divergence between morphs in this region. However, refining the It is also not clear how This is even more so given that the analysis to demonstrate balancing selection as the mechanism for the maintenance of this polymorphism require strong assortative mating, something which is not observed in natural populations. answer it with the data presented here, and doing it justice would likely involve a large amount of extra effort, which we feel is outside the scope of the current study. characteristics (and history) are compared to a simulation that assumes uniform processes across the entire region; and 2) the simulation analysis and its results becomes more complicated to present and more difficult for the reader to understand and interpret. It certainly raised a number of questions for me.
While I continue to think that defining the "red locus" as the region of clearly elevated Fst and LD is the only sensible approach, I understand that adopting my suggestion would require repeating a number of analyses. On the other hand, by arbitrarily defining the red locus as encompassing a larger region than it really does, the authors add unnecessary confusion and complication to what is a striking and obvious result. I note that in the abstract the authors refer to the smaller ~72kb region of the Z chromosome as being associated with the color polymorphism. Likewise, Toomey et al. (2018) identify the same ~70kb region as the "candidate locus." Either 70 or 72 kb reflects a much more sensible definition than the ~100kb adopted for the simulation analysis.
While I continue to disagree with the authors' decisions and rationale on this issue, the additional information and clarifications the authors provide are probably sufficient to make the analyses understandable for most readers.
The only other concern I would raise is that a very similar paper was recently published. Toomey et al.
(2018) present similar analyses for the same species, including RNA-seq data and a genome scan based on whole genome re-sequencing data, and obtain broadly similar results. The Toomey et al. paper is based on samples obtained from captive birds in Portugal rather than a wild population, but in most other respects, it is in my opinion a better paper, with more clearly presented analyses that are based on more robust data sets. (Please note that I have no connection to the Toomey et al. study, do not know any of its authors, and just recently learned about it.) Given that the Toomey et al. paper has been published, the authors of this paper should of course cite it, but also explicitly address the question of how the analyses they present extend our understanding beyond the results of Toomey et al. For example, the simulation analysis is one component that is unique as compared to Toomey et al. Reviewer #3: Remarks to the Author: This revised version has improved the clarity of the text further and I think this will make a very interesting contribution to the field of genetics of polymorphisms in general, and of course color polymorphisms in particular.