Fitness consequences of artificial selection on relative male genital size

Male genitalia often show remarkable differences among related species in size, shape and complexity. Across poeciliid fishes, the elongated fin (gonopodium) that males use to inseminate females ranges from 18 to 53% of body length. Relative genital size therefore varies greatly among species. In contrast, there is often tight within-species allometric scaling, which suggests strong selection against genital–body size combinations that deviate from a species' natural line of allometry. We tested this constraint by artificially selecting on the allometric intercept, creating lines of males with relatively longer or shorter gonopodia than occur naturally for a given body size in mosquitofish, Gambusia holbrooki. We show that relative genital length is heritable and diverged 7.6–8.9% between our up-selected and down-selected lines, with correlated changes in body shape. However, deviation from the natural line of allometry does not affect male success in assays of attractiveness, swimming performance and, crucially, reproductive success (paternity).

M ale genitalia are remarkable for their extreme divergence among species in size, shape and complexity [1][2][3] . Despite high variation among species in mean relative genital size, within-species body-size scaling (static allometry) is very tight in some species; most of the variation in genital size is explained by variation in body size (that is, there is a high R 2 value for the regression of genital on body size). If there is genetic variation for relative trait size, because males vary in their ontogenetic allometric slopes and/or intercepts, then without strong selection on deviations from the natural allometric relationship genetic drift should reduce R 2 . Tight allometry implies that males with novel trait-body size combinations (relatively large or small genitals for their body size) have lower fitness.
In live-bearing poeciliid fishes, males use their modified anal fin (gonopodium) to inseminate females. Gonopodium length varies across species from 18 to 53% of mean body size 4 . However, as in other taxa, there is often low intraspecific variation in genital length for a given body size 5 . For example, in the mosquitofish G. holbrooki, body length explains over 90% of variation in gonopodium length. Even so, recent selection analyses of poeciliids find that relative genital size and shape is associated with male mating and/or reproductive success in both G. holbrooki 6,7 and a related species, the guppy, Poecilia reticulata 8,9 . Surprisingly, the detected selection is directional rather than stabilizing and favours males with a relatively large gonopodium for their body size.
A weakness of selection analyses is that they are correlational. A relationship between a focal trait and reproductive success can arise if both are affected by another variable 10 . In red deer, Cervus elaphus, for example, favourable environmental conditions lead to larger male antlers, but also elevate female breeding success, generating a spuriously high estimate of the selection gradient on male antler size 10 . How then do we determine whether relative genital size in poeciliids is under selection, which might account for its precise relationship with body size? One approach, especially in sexual selection studies, is to experimentally manipulate a focal trait 11,12 . To date, the only experimental evidence for direct selection on male genitals comes from gross manipulation of genital features by ablating or surgically removing major components. These studies demonstrate that certain genital traits affect fertilization success [13][14][15][16][17][18][19] . The problem with such experiments is that developmentally integrated traits that might affect the consequences of manipulating a single trait are left unchanged 2,20,21 . A reduction in male fitness might therefore reflect a lack of compensatory developmental changes in other traits rather than sexual selection acting directly on the manipulated trait 2,20 .
Crucially, we lack experimental studies in which novel genitalbody size combinations are created such that developmentally integrated correlated traits can still co-evolve. An alternative approach to achieve this is to use artificial selection and compare the fitness of control and selected lineages. If a focal trait is heritable, artificial selection might even shift the mean value outside the natural range but, importantly, genetically or developmentally correlated traits will change in concert. Any resultant effects on fitness are therefore not attributed to a mismatch in the expression of co-evolved traits. Artificial selection should reduce fitness if the focal trait is under strong natural and/or sexual selection 21 .
Artificial selection on allometric intercepts or slopes to alter trait-body size relationships has been applied to naturally selected traits [21][22][23] , but surprisingly few studies have done so for putative sexually selected traits. The available studies have targeted traits such as ornaments, testes and weaponry that are assumed to affect male mating rate, sperm competitiveness and fighting success, respectively [24][25][26][27][28][29] . Alternatively, researchers have selected on net male attractiveness 30,31 . Only one study has selected for an aspect of male genitalia, namely absolute, but not body sizecorrected, genital spine size in a beetle 13 . Surprisingly, given the ubiquity of high interspecific variation in genital size in many taxa, there are no studies using artificial selection to create males with novel combinations of genital and body size (outside the natural range of variation for the species in question, although these combinations might occur in closely related species). Creating these novel phenotypes is most readily achieved by selecting on the allometric intercept 32,33 . In principle, it could also be achieved by only selecting on the allometric slope such that mean relative trait size stays unchanged (so relative trait size increases for some males and decreases for others, depending on whether they are of larger or smaller than average body size). Only three studies have selected on allometric slopes in this way [34][35][36] . Theory suggests that it is more difficult to change allometric slopes than intercepts 22,23 .
To test whether there is strong sexual and/or natural selection on relative genital size, we artificially selected on the intercept of the allometric regression to either increase or decrease mean relative gonopodium length in G. holbrooki (three replicates of up-selected, down-selected and control lines). The tight relationship between gonopodium length and body size raises questions about the role of current selection against deviations from the natural line of allometry versus past selection for developmental trajectories that generates strong covariance between traits. After we applied artificial selection for eight generations, gonopodium length had diverged by 7.6-8.9% between our up-selected and down-selected lines. Mean relative gonopodium length changed in the direction of selection, while the allometric slope remained unchanged. We then tested whether deviations in either direction away from the natural line of allometry affect male fitness. Previous studies have reported directional selection on relative gonopodium length, but we did not find that males from upselected lines are more attractive to females 11,12 nor that they have weaker swimming performance 12 than down-selected males. In combination, we did not find that the net effect of selection is greater male reproductive success (fitness) for control line males than males in either the up-selected or down-selected lines when they freely compete for mates and fertilization opportunities in semi-natural pools. In short, we did not find that novel genital-body size combinations are selected against.

Results
Evolution of relative genital size. In wild-caught male G. holbrooki, body size accounted for 91.2% of variation in gonopodium length (N ¼ 545). In conjunction with weakly negative allometry (slope of log gonopodium length on log body length regression o1: 0.918±0.012), there is therefore little variation in relative genital size (30.5 ± 0.04% of body length; all summary statistics are mean±s.e.). Despite this precise allometry, we observed a clear response to artificial selection on mean relative gonopodium length ( Fig. 1 and Supplementary  Fig. 1), indicating no short-term constraints on genital size evolution 37 (but see refs 22,23,31). To test the response of mean relative gonopodium length, male body size and the allometric slope of the gonopodium-body size regression (see Methods for details of response variables) to selection on relative gonopodium length, we ran separate linear models (LMs) for up-selected and down-selected lines for each trait, treating replicate, generation and their interaction as factors. Selection on mean relative genital size resulted in bidirectional evolution (LM: generation: Up: F 1,21 ¼ 119.08, Po0.001; Down: F 1,21 ¼ 67.24, Po0.001) that did not differ between replicates (LM: Up: F 2,21 ¼ 2.76, P ¼ 0.09; Down: F 2,21 ¼ 1.037, P ¼ 0.37). Individual regressions of relative gonopodium length on generation were highly significant for all six selection lines (LM: all Po0.005, N ¼ 9 generations, R 2 ¼ 69.8-89.9%). After eight rounds of selection, the gonopodium of an average-sized male when selected upward was 4.97%, 4.26% and 6.78% larger (replicates A, B and C, respectively), and when selected downward was 3.93%, 3.30% and 2.13% smaller, than that of a control line male ( Fig. 2 and Supplementary Fig. 2). This difference persisted after a generation of relaxed selection. The mean relative gonopodium length of up-selected males was 6.66, 3.51 and 4.95% greater, and that of down-selected males was 3.84, 2.95 and 2.73% smaller, than that of control males. The realized heritability of mean relative gonopodium length was 0.028±0.006 in the up-selected lines and 0.022 ± 0.005 in the down-selected lines (Supplementary Table 1). Higher realized heritability estimates are obtained using a less conservative approach (see Supplementary Methods).
Male attractiveness. Male genitalia can directly influence female mate choice in species with external intromittent organs, including poeciliids 11,12 (but see ref. 38) and humans 39 . Artificial selection did not, however, affect male attractiveness, when measured as association times of wild-caught females (N ¼ 151) that were presented with three size-matched males (control, down-and up-selected; Supplementary Fig. 4). Females spent 39.9±1.2% of each 20 min choice trial associating with compartments housing males and a linear mixed-effects model (LMM) showed that females preferred males over a fourth empty compartment (Wald's w 2 ¼ 73.89, df ¼ 3, Po0.0001; Table 1 and  Supplementary Table 2). Females spent on average 157.2±12.8 s with the control, 164.5 ± 12.2 s with the down-selected and 157.1±13.5 s with the up-selected male (respectively, 32.6±2.2, 34.7 ± 2.2 and 32.1 ± 2.2% of total association time). Females did not prefer males from up-selected lines that have a relatively long gonopodium for any given body size.
Male reproductive success. To test for sexual selection on relative gonopodium length, we stocked ten large ponds per replicate (700 l; 30 ponds in total) with eight wild-caught virgin females  and six males from generation 8. These comprised two trios of size-matched 'small' and 'large' males ('small' were B80% the size of 'large'); each trio consisted of a control, up-selected and down-selected male. We genotyped all males, the 165 females that gave birth and their 2,284 offspring to assign paternity based on B4,400 single-nucleotide polymorphisms (see Supplementary Methods). On average, males sired 12.79±1.37 offspring (N ¼ 173 males). For the 104 males that gained some paternity, the average number of offspring was 26.04±3.11. Neither artificial selection on mean relative gonopodium length nor male size explained how many offspring a male sired (generalized LMM: selection: Wald's w 2 ¼ 0.79, df ¼ 2, P ¼ 0.68; size: Wald's Fig. 3 and Tables 3 and 4). Although mean paternity success appeared to differ across replicates (Fig. 3), replicate was not a significant predictor in the model (generalized LMM: Wald's w 2 ¼ 3.27, df ¼ 2, P ¼ 0.20; Table 4). If reproductive success was treated as a binary outcome (a male either sired offspring or did not), this post-hoc test showed that large males were significantly more likely to sire offspring than were small males (binomial LMM: Wald's w 2 ¼ 5.22, df ¼ 1, P ¼ 0.02). Again, however, there was no effect of artificial selection on gonopodium length (Table 4). Thus, when males could freely compete for mates and sperm competition occurred, artificial selection revealed no fitness cost associated with the  evolution of relatively larger or smaller gonopodia than occur naturally for males of a given body size.
Body shape. There is strong selection on body shape in fish because of its hydrodynamic effects. For example, population comparisons in Gambusia species show convergent evolution of body shape in response to predation risk 43 . We used standard body shape landmarks for fish ( Supplementary Fig. 6) in a recently developed geometric morphometric analysis 44 Table 4). Up-selected males had a deeper body and more posterior gonopodial insertion than down-selected or control line males, and females from up-selected lines had a deeper abdomen and shorter tail than down-selected or control line females ( Supplementary Fig. 8). The swimming performance trials indicate, however, that these body shape differences did not affect burst-swimming speed.
Gonopodium tip shape. Population comparisons in Gambusia and other poeciliids reveal relationships between gonopodium tip shape and predation risk 41 , and selection analyses have linked tip shape to male fertilization success 8 . Geometric morphometric analyses showed that tip shape was related to the size of the distal part of the gonopodium (Procrustes MANOVA: F 1,411 ¼ 25.44, Po0.001; Supplementary Fig. 9 and Supplementary Table 3), which is also highly correlated with total gonopodium length (r ¼ 0.835, Po0.001, N ¼ 411). Tip shape therefore differed among selection treatments, simply because males differed in gonopodium length (Figs 1 and 2). The allometric relationship between tip size and shape did not differ among selection treatments (Procrustes MANOVA: F 2,399 ¼ 1.362, P ¼ 0.245; Supplementary Table 3). Correcting for size, there were no differences in tip shape in response to artificial selection on gonopodium length (F 2,399 ¼ 0.775, P ¼ 0.647; Supplementary Table 3).
Fecundity. Natural deviations from the line of allometry might reflect individual variation in quality. For example, many sexual traits are condition dependent, including some genital traits 45 . Greater body condition is likely to have beneficial effects; thus, artificial selection on gonopodium length might have indirectly selected for males in good condition. If condition itself is heritable, we may have indirectly improved mean body condition (of both sexes, if condition has a positive inter-sex genetic correlation). This increase in body condition could elevate female fecundity and/or male fertility. We therefore measured the within-line success of 120 mating pairs per selection treatment. Larger females produced more offspring (generalized LMM: Wald's w 2 ¼ 16.14, df ¼ 1, Po0.0001; Table 5 and Supplementary  Table 5); however, controlling for replicate (generalized LMM: Wald's w 2 ¼ 5.29, df ¼ 2, P ¼ 0.07), there was no difference in fecundity between up-selected, control and down-selected pairs (generalized LMM: Wald's w 2 ¼ 1.14, df ¼ 2, P ¼ 0.57; Table 3).
The proportion of pairs that produced offspring varied among selection treatments (control: 83.3%; up-selected: 78.3%; down-selected: 65.8%; each N ¼ 120); however, after accounting for replicate effects, these differences were not significant (generalized LMM: Table 5).   Discussion Natural and sexual selection have been proposed to explain key aspects of allometric scaling of male genitals in many taxa, mainly related to allometric slopes 22,23,46 . Here we focus on the role of selection in explaining another aspect of genital allometry. In wild populations of G. holbrooki, there is a very precise relationship between male genital size and body size (R 2 490%). Consequently, there is low variation in relative genital size (that is, genital size corrected for body size). Despite the logistic challenge posed to ensure measurement error does not obscure estimates of a male's relative gonopodium length, we successfully used artificial selection in both directions to produce males with  gonopodia that were beyond the naturally observed range of lengths for their body size ( Fig. 2 and Supplementary Fig. 2). Therefore, there are no immediate developmental or genetic constraints preventing relative genital size from evolving. Although our estimates of realized heritability are low, we suggest they be treated cautiously. Given stochastic variation in body size and allometric slopes across generations (Supplementary Fig. 2; see also refs 22,46), there is probably similar variation across lines within generations that introduces noise into estimates of selection differentials and responses to selection, and hence heritability (h 2 ¼ R/S c ; see Methods). In contrast, the steady generational change in mean genital size in selected lines (relative to control lines) for an average-sized male is readily apparent (Fig. 1). This steady change is also interpretable in terms of realized heritability, as selection intensity was similar each generation, because we always bred from males in the top/bottom 40 of 129 ± 3 measured males. We applied artificial selection on residuals perpendicular to the allometric line so that genital and/or body size could evolve. As in other studies taking this approach 32,33 , the mean value of the focal trait evolved (that is, gonopodium length) while mean body size did not. This difference in response suggests stronger stabilizing selection and/or lower heritability of absolute body size than absolute genital size. As in most studies, there was also an asymmetric response to selection 47 . The response was stronger in up-selected than in down-selected lines ( Fig. 1; see also refs 35,36).
We expect a tight trait-body size relationship with low variation around the line of allometry if there is strong selection for specific trait-body size combinations. For gonopodia, there is evidence from previous studies that selection on relative size could arise from the combined effects of natural and sexual selection. There are several lines of correlational evidence for sexual (or natural) selection on male genitalia. First, comparative analyses show that interspecific genital shape diversity is lower in insect clades where females are monogamous rather than polyandrous, implicating sexual selection 48 . Similarly, the intensity of post copulatory sexual selection sometimes predicts genital evolution 49 . Second, selection analyses show that natural variation in genital size and/or shape predicts paternity in some species [50][51][52][53][54] , including G. holbrooki 6,7 , implying that these traits are sexually selected. Selection analyses in poeciliids also suggest that there is natural selection on genital size, because it affects locomotion, which should affect survival under predation 5,20 . Third, experimental evolution studies where sexual selection is present or absent 55 report that male genitalia evolve as predicted by selection analyses 56,57 .
We currently lack experimental studies in which novel genitalbody size combinations are created in such a way that developmentally integrated correlated traits can co-evolve. Here we achieved this goal using artificial selection and then tested for selection against deviation from natural allometry. We investigated components of selection acting both directly on genital size and on correlated traits (for example, effects on females arising from inter-sexual genetic correlations). Our most important finding was clear. In a competitive situation where swimming performance, rates of copulation attempts, insemination success and fertilization ability are all likely to affect male reproductive success, there was no detectable effect of deviation from the line of natural allometry (Fig. 3). This key finding is consistent with our detailed investigation of specific potential sources of variation in male fitness. First, despite previous work reporting that a relatively longer gonopodium slows a male's escape response in Gambusia species 12 , there was no detectable decline in burstswimming speed in up-selected males. Second, there was no detectable sexually selected cost of a shorter gonopodium due to reduced male attractiveness. It should be noted that previous experimental evidence for this relationship in G. holbrooki is based on cutting 15-17% off the gonopodium 11 , whereas the difference between up-selected and down-selected lines was o9%. Third, there was no change in size-corrected genital tip shape, a trait that predicts fertilization success in other poeciliids 8 .
It is important to note that our main finding of no effect of relative genital size on male reproductive success is unlikely to be due to low statistical power. We assigned 2,284 offspring to 173 potential sires. Three recent selection analyses of G. holbrooki and P. reticulata, which reported that relative gonopodium length explains significant variation in male reproductive success, were all smaller than our study (Head et al. 6 assigned 844 offspring to 240 potential sires, Vega-Trejo et al. 7 assigned 629 offspring to 122 potential sires and Devigli et al. 9 assigned 532 offspring to 60 males). These studies highlight the discrepancy between our results and past selection analyses. One plausible explanation is that variation in relative gonopodium length in past selection analyses is due to environmental factors that affect fertilization ability. For example, the diet of male G. holbrooki affects relative gonopodium size 58 and sperm production 59 . A similar confounding variable affecting body condition could explain the correlation between swimming performance and relative gonopodium size 12 . It would be useful to test whether the recently reported strong effect of relative gonopodium length on paternity in guppies 9 is still detected after artificial selection.
Tight allometry of a focal trait need not be due to direct selection on the trait. It might arise due to selection on genetically and developmentally correlated traits (for a possible example, see ref. 35). For example, genital appendages and horn size are genetically correlated in beetles 60 and optimal horn size depends on body size for biomechanical reasons. Tight genital-body size relationships could similarly arise due to size-dependent allocation of resources to developmentally linked traits under size-dependent selection. For example, ablation of imaginal discs precursory to genitalia increased horn size in dung beetles 61 ; thus, environmental variation in resources could yield a genital allometry that is driven by strong size-dependent selection on horns. Selection on other traits should, however, still lead to lower fitness of males selected away from the line of allometry, which we did not observe. Another possibility is that inter-sexual genetic correlations constrain trait allometry. Indeed, in G. holbrooki, female body shape did show a correlated response to selection on male genitals but, again, there was no detectable decline in female swimming performance or pair fecundity in selected lines. Finally, we must acknowledge that, as in most studies, we cannot measure net fitness. Instead, we can only measure some components of fitness. It is therefore possible that we failed to detect selection on deviation from the line of natural allometry because we did not measure the appropriate fitness component. Crucially, however, we did measure paternity.
In sum, high variation among species in mean relative genital size is a common pattern in many animal taxa; hence, it is surprising that no previous studies have used artificial selection to alter genital size and test for the effects on fitness. This study design is a powerful way to detect fitness effects of deviation from natural trait-body size relationships (for example, artificial selection on relative butterfly wing size produced males with novel large-or small-winged phenotypes with lower mating success than control males 32,33 ). Artificial selection increases the available phenotypic variation, which should make it easier to detect fitness costs than when conducting selection analyses on standing variation 62 . This is especially relevant for traits with high R 2 and hence low variation in relative size 42 . The lack of evidence for selection against deviations from the natural line of allometry in our study is therefore a genuine conundrum.
Unfortunately, difficulties in reporting unexpected findings lead to well-known publication bias that systematically distorts science 63 . As such, it is difficult to assess whether our results are genuinely anomalous or reflect a larger file drawer problem in evolutionary biology.
Initiation of experimental lines and selection protocol. We collected eastern mosquitofish (G. holbrooki) from Western Sydney, Australia, in March-May 2007. Fish were housed communally in 120-l tanks at 28°C on a 14:10 h light:dark cycle and fed ad libitum (twice daily) with Artemia nauplii and fish flakes. We set up 180 gravid females in individual 3-l tanks until we had 150 broods of laboratory-born offspring (there is multiple paternity so the number of sires is 4180). To obtain virgin females we continually removed males, who can be identified as soon as their anal fin begins to elongate into a gonopodium. We used 540 virgin females to create 9 experimental lines (60 females per line), with all females within a line originating from a different brood. Given multiple mating in G. holbrooki, broods almost always consist of maternal half-siblings. The adult males that we used to initiate lines were field collected in December 2007.
We set up three replicates (A, B and C) between December 2007 and May 2008. Each replicate comprised two selection lines ('Up' and 'Down') and an unselected control line. To select founding sires (generation 1) for each replicate, we measured the body size (standard length, SL) and gonopodium length of 121 (A), 140 (B) and 171 (C) wild-caught males. Each male was briefly immobilized in iced water (o4°C), then photographed with a Nikon Coolpix 5700 digital camera attached to a Leica Wild MZ8 dissecting microscope. Male SL and gonopodium length (from the tip to the juncture between the two last clear segments of the gonopodium before it attaches to the body wall; Supplementary Fig. 6) were measured using ImageJ software (http://imagej.nih.gov/ij/). The allometric relationship for each line was calculated as the reduced major axis (RMA) regression of log gonopodium length on log SL; gonopodium allometry did not differ for the three sets of wild-caught males that initiated the replicates (slopes: F 2,539 ¼ 0.469, P ¼ 0.626; intercepts: F 2,539 ¼ 0.247, P ¼ 0.781). We selected males based on their deviation from the regression line (using positive residuals for up-selected and negative residuals for down-selected males). Selection based on these residuals, which are perpendicular to the regression line, should shift the intercept (that is, mean relative gonopodium size), but not the slope, of the allometry 32,33 . This protocol could lead to the evolution of mean relative gonopodium size due to selection on body size and gonopodium size. In contrast, the use of ordinary least squares regression does not generate direct selection on body size. In practice, the use of residuals from ordinary least squares regressions identified almost exactly the same males for selection in every line in every generation. We selected males with the largest (Up) or smallest (Down) relative gonopodium length: 30 males per selection line for replicates A and B, and 40 males per selection line for replicate C. As not all pairs bred, the number of selected males was increased to 40 in all replicates in all subsequent generations to increase the likelihood that at least 30 males successfully sired offspring. For the control lines, 30 (or 40) males were chosen at random from another group of wild-caught males (that is, we did not exclude males that might otherwise have been assigned to an Up or Down line). The least squares regression of log gonopodium length (cm) on log SL (cm) for the 545 wild-caught males was Y ¼ 0.918±0.012*X À 0.491±0.004 (R 2 ¼ 91.2%). The slope was significantly less than unity (t ¼ 6.83, Po0.001), showing negative allometry. The same was true using RMA regression, where the mean slope was 0.968 ± 0.014 (R 2 ¼ 90.8%).
Each male was paired consecutively with two virgin females to increase the likelihood that all males sired offspring. Each pair was placed in a 3-l tank for 1 week. The male was then removed, while the female was allowed to produce one to two clutches. Fry were removed from their mothers' tanks on the day of birth; we kept five to ten fry per mother (to obtain B10 fry per male) to establish generation 2. Fry were reared in 3-l tanks for B1 week (to minimize the risk of early mortality) and then pooled and reared at densities of one to two fish per litre in 120-l tanks. Siblings were split across tanks to minimize any decline in genetic diversity if tanks were lost due to accident. Again, fry were separated by sex at the first signs of maturity, to ensure females remained virgins.
For generations 2-8, once males were mature they were isolated in 1-litre tanks. We measured 129.2±3.1 Up line, 128.5±3.1 Down line and 96.0±3.3 control line males for each generation and replicate (N ¼ 21 selection events; 7 generations of 3 replicates). For each line, the 40 males with the most positive (Up lines) or most negative (Down lines) residuals, or 40 randomly chosen males from the control line, were selected to sire the next generation (that is, the top or bottom 31% in selected lines, given a mean of 129 measured males). Each selected male was paired consecutively with two randomly chosen females from his line. After reproducing, males were killed and preserved in Dietrich's solution.
The median number of sires per generation was 30 (range: [27][28][29][30][31]. Breeding success varied across generations but we kept the number of males contributing to the next generation as similar as possible across lines within each replicate, while ensuring we had B300 fry per line. The mean number of actual sires in generations 2-8 when artificial selection was applied was 31.4 ± 0.6 (N ¼ 63 breeding events; Up: 31.6 ± 0.9, Down: 31.0 ± 1.2, Control: 31.7 ± 1.0, each N ¼ 21). In generation 9 we did not select males based on relative gonopodium length but randomly used 60 males per line as sires (one female per male), recording their SL and gonopodium length to obtain the population means for this generation. The data for male SL and gonopodium length for generations 1-9 are provided in Supplementary Data 1. Finally, we recorded the SL and gonopodium length of 69.8 ± 5.5 males per line in generation 10, to test whether the observed differences persisted in the absence of selection in the preceding generation (Supplementary Data 2).
Response to selection. We ran separate general LMs for up-selected and down-selected lines treating replicate, generation and their interaction as factors. There were stochastic environmental effects that affected absolute body size each generation ( Supplementary Figs 1 and 3). We therefore decided not to use absolute values of response traits. Instead, following common practice 32,33 , we used the deviation of each selection line mean from the control line mean for the relevant generation and replicate. The three response traits were mean gonopodium length, mean body length and the allometric slope (RMA regression) ( Supplementary  Figs 1 and 3). For gonopodium length, we calculate the value of an average SL (22.18 mm) male from the RMA regression for the relevant generation, replicate and selection line.
We calculated the realized heritability of relative gonopodium length separately for up-selected and down-selected lines, following the methods in ref. 47. The focal 'trait' was each male's residual from the control allometric slope (RMA regression) in the same replicate, in the same generation. This approach was necessary because of stochastic environmental variation across generations ( Supplementary Fig. 3). We regressed the cumulative effective selection differential S c against the total response to selection, R (the mean of the difference between the expected value of the trait based on the control line in that generation and the observed value for all males in the focal line) (Supplementary Table 1). All six R on S c regression lines were significantly 40 for up lines and significantly o0 for down lines (all Po0.049). Realized heritability (h 2 ) is twice the value of the regression slope, as we only selected on males. Realized heritability estimates were small (0.016-0.038). We estimated the s.e. of the realized heritability based on the three h 2 estimates per selection regime (up or down), as the use of s.e. associated with each regression line (or pooling the lines) underestimates variation 47 . We calculated the regression using all nine generations because of the approximately linear generational response to selection (Fig. 1) in conjunction with a selection protocol that was very similar across generations and no obvious change in the scatter of residuals around the line of allometry (R 2 ¼ 90.8 ± 0.7%, N ¼ 72). The use of all nine data points produced a conservative estimate of h 2 . Visual inspection of R and S c suggested some nonlinearities in the response to selection; excluding later generations improved linearity and increased h 2 estimates (see Supplementary Methods).
Terminal trait measurement assays. After seven rounds of artificial selection, in 2012 we used individuals from generation 8 to measure the effects of selection on mean relative gonopodium length on morphological, physiological and reproductive (fitness related) traits in both sexes. Assays were performed within replicate, as temporal separation of the A, B and C replicates prevented their combination. The selection lines showed clear divergence in their allometric intercept, but not slope, in all three replicates ( Supplementary Fig. 2). We deliberately restricted our analyses to a limited set of traits based on a priori justifications that we made at the start of the experiment about the probable effects of a change in relative gonopodium length on these traits. Line means for the assayed traits are presented in Table 3.
Male attractiveness. We tested whether female preferences differed for males from the three selection regimes. Within each replicate, we created size-matched trios (o0.1 mm SL) with a male from each of the Up, Down and Control lines (A: N ¼ 51 trios, B: N ¼ 58 trios and C: N ¼ 42 trios). The males were individually placed in triangular corner compartments (9 Â 9 Â 13 cm) of a square choice arena (36 Â 36 Â 15 cm; Supplementary Fig. 4). The 13-cm wall facing the arena was made of clear Perspex. The fourth corner compartment was empty, to test whether females preferred to associate with males. We randomized male corner positions with respect to selection treatment. The external arena walls were lined with black plastic and the base covered with gravel. Each size-matched trio was used in a single trial. Males were then returned to their tanks. Test females were wild caught as juveniles and maintained in single-sex tanks, to ensure virginity. All females used were sexually mature and previously unmated; hence, they were likely to be receptive to mating. In each trial, a female was placed in a clear plastic cylinder in the centre of the choice arena and allowed to acclimate for 10 min. The cylinder was then raised remotely and her activity recorded for 10 min using a digital video camera positioned directly above the arena. The female was then caught, replaced in the cylinder for 2 min, re-released and recorded for another 10-min period. We used the two 10-min halves of the trials to test the temporal consistency of the female response.
Video analysis was performed blind to the position of each male type. Female preferences were inferred from their association time with each male (see ref. 11), measured as the total time spent o4 cm from his compartment ('association zone'). We tested for differences in the time females spent in each association zone (that is, with males from each selection treatment or the empty corner) with a linear mixed model in the R package lme4 (v1. [1][2][3][4][5][6][7][8][9] 64 with male selection treatment and replicate as fixed effects and trial (female) identity as a random effect (each female provided four data points, one per male and for the empty corner). Again, we did not calculate a random effect for replicate, as it only had three levels. This analysis provides information on the relative time spent with each type of male (Table 1). Females spent only 45.7±5.6 s in the equivalent 'association zone' of the empty compartment. Rerunning the model excluding the empty corner and including line identity produced almost identical parameter estimates (analysis not shown).
Females spent on average 39.9 ± 1.2% of the 20-min trial associating with males. This 'total male association time' was not related to the female's size (F 1,146 ¼ 0.102, P ¼ 0.750) or the size of the males available to her (F 1,146 ¼ 1.010, P ¼ 0.317; Supplementary Table 3). Females spent less time associating with males in the second half of the trial (paired t-test: t 150 ¼ 2.68, P ¼ 0.008). There was, however, still a significant intra-class correlation for time spent with males, indicating that females varied significantly in their propensity to associate with males (r ¼ 0.27, N ¼ 151, P ¼ 0.0003). In addition, individual females showed consistency in the proportion of time they spent with specific males. The intra-class correlations (ICC) for the proportion of the total association time spent with the Control male, the Down male or the Up male, respectively, between the two halves of the trial were all significant (ICC ¼ 0.17-0.35, all N ¼ 151, Po0.021). We present analyses for the full 20 min of the trial. The data for female association times are provided in Supplementary Data 3.
Swimming performance. We tested whether burst-swimming performance (acceleration during a startle response) differed among the three experimental lines. We used 49-53 males and 50 females from each of the 9 lines. Each fish was placed in an opaque plastic tank (24 Â 29 cm), with water to a depth of 10-15 mm to limit movement on the vertical plane. A rigid plastic cylinder with a rubber base was suspended in one corner of the tank so that its base just broke the water surface ( Supplementary Fig. 5). This stimulus was released when the focal fish was o10 cm from it. It hit the base of the tank, startling the fish so that it performed a 'C-start' escape response. This response has a characteristic form in which the fish first contracts its lateral musculature to form a C-bend shape 65 . We recorded three consecutive C-starts per fish (N ¼ 2,733 trials). To calculate the repeatability of burst-swimming behaviour, we re-tested ten males and ten females per line the next day 66 .
Each trial was filmed from above using a digital camera with high-speed video (240 frames per second; Casio Exilim EX-FH100). We analysed the footage frameby-frame in ImageJ using the plugin MtrackJ (http://www.imagescience.org/ meijering/software/mtrackj/) to determine the distance travelled, velocity and acceleration over the first ten frames (38 ms) of the response. The starting point of each track was the position of the fish in the frame that immediately preceded the 'C'-bend. We recorded the distance from the starting point to the stimulus and the orientation of the fish relative to the stimulus in the starting frame. Preliminary inspection of the data showed that neither factor predicted swimming performance; thus, we excluded them from the final analyses. These video analyses were not performed blind, but the repeatability of our tracking measurements (that is, measurement error) was estimated by re-analysing a random subset of 20 videos.
Although  Po0.0001, N ¼ 69). Thus, to ensure we used the most consistent and meaningful estimate of burst speed performance, for analysis we selected the fastest trial for each fish (the trial with the greatest distance travelled over ten frames) 12 . As the total distance travelled, maximum velocity (greatest distance between consecutive frames) and maximum acceleration (greatest increase in velocity over consecutive frames) were strongly correlated (r ¼ 0.38-0.94) we restricted our analyses to distance travelled. Owing to the strong sexual size dimorphism in G. holbrooki and the need to control for body size, we analysed males and females separately. We used linear mixed models in lme4 to test whether artificial selection affected burst speed performance. SL, water temperature and replicate were treated as fixed effects, with line identity (nine levels) included as a random effect ( Table 2). The data for male and female swimming performance are provided in Supplementary Data 4.
Male reproductive success. We tested whether selection on gonopodium length affected male paternity success in a semi-natural competitive setting. We set up ten 700-l (1.5 m diameter, B40 cm deep) plastic fishponds per replicate housed in a glasshouse at 28°C. Each pond contained eight adult virgin females (wild-caught as juveniles) and six males: two Up, two Down and two Control. Males were size-matched (o0.1 mm SL) in trios with one male from each line. Each pond contained a trio of large males and a trio of small males (the average size difference between large and small males in a pond was 4.7 ± 1.4 mm; males in the small trio were 80.1 ± 5.2% the size of males in the large trio). We gave each male a unique colour tag with a subcutaneous elastomer implant, injected behind the dorsal fin, while fish were immobilized in iced water (o4°C).
Fish were left in the ponds to interact freely for 7 (A) or 14 (B and C) days. Males were then photographed for morphometrics (see below) and preserved for genotyping. Females were placed in individual 3-l tanks to produce one to two broods. All fry were individually preserved for genotyping. After B10 weeks, mothers were preserved for genotyping.
To assign paternity we genotyped single-nucleotide polymorphism for every female that produced offspring (N ¼ 165: A ¼ 32, B ¼ 64, C ¼ 69), all potential sires (N ¼ 179 in total; 1 male from B died during the mating period and we could not extract DNA) and every offspring produced from our mating experiments (N ¼ 2,284 offspring: A ¼ 369 from 42 clutches, B ¼ 692 from 89 clutches, C ¼ 1,223 from 102 clutches). We used the commercial genotyping services of Diversity Arrays Technology who have developed a widely used technique called DArTseq 67 (see Supplementary Methods). We could unambiguously assign paternities for all fry in 29 of the 30 pools. In one pool (replicate A) only six fry were produced. They did not match any of the putative sires. The most probable explanation is that the mother had mated and stored sperm before entering the pool. These data were discarded (final N ¼ 173 potential sires available for analyses).
Owing to high over-dispersion and the fact that only 104 of 173 males gained paternity, we used the number of offspring fathered by a male as the response variable in a zero-inflated negative binomial mixed-effects model, implemented in the R package glmmADMB (v0.8.0) 68 . This allows for the inclusion of random effects, but is limited as the zero-inflated part of the model has a constant estimation. However, based on Akaike Information Criterion (AIC) scores it provided a better fit than using negative binomial error without zero inflation (1089.2 versus 1109.6). We included replicate, selection treatment (Up, Down and Control), male size class (small and large) and the treatment Â size interaction as fixed effects, and pond identity (29 levels) and line identity (9 levels) as random effects. We included replicate as a fixed rather than random effect, because it only has three levels. In addition, line identity should already account for most of the relevant variation (Table 4). In a second analysis, we treated paternity success as a binary variable (that is, did a male sire any offspring) in a generalized linear mixed model with a binomial error structure and a logit link function, using the same fixed and random effects (Table 4). There was a significant effect of replicate on the likelihood of siring any offspring. Fewer males gained paternity success in replicate A compared with B or C (Wald's w 2 ¼ 17.843, df ¼ 2, P ¼ 0.0001; Table 4). This is likely to be due to the shorter time allowed for mating in replicate A. The data for male reproductive success are provided in Supplementary Data 5.
Body and gonopodium tip shape. We photographed the left side of anaesthetized fish and digitized ten landmarks per image ( Supplementary Fig. 6) for geometric morphometric analysis (see Supplementary Methods). Gonopodia were photographed separately and positioned swung forward to view the distal tip. We measured the length of the entire gonopodium using ImageJ and digitized six landmarks ( Supplementary Fig. 6) on the distal portion of the gonopodium, which is the only part inserted into the female. We found vectors that described variation in male body and gonopodium shape, and female body shape using the R package geomorph (v2.1.5) 44 . The analysis provides Procrustes coordinates and centroid size for each specimen. To aid visualization and describe variation among fish, we found the axes of major shape variation using principal components analysis of the Procrustes coordinates ( Supplementary Figs 7-9). We constructed LMs to determine whether male body shape, female body shape and gonopodium tip shape changed with artificial selection on gonopodium length. In each model the response variable was the two-dimensional set of coordinates that had been aligned using the generalized Procrustes analysis. Selection treatment (Up, Down and Control) and line identity within selection regime were included as factors. We included centroid size as a covariate to control for size-related shape changes (that is, static allometry). Models were run using the procD.lm function in geomorph (Supplementary Table 3). This performs Procrustes MANOVA with random permutation, to quantify the relative amount of shape variation that is attributable to predictor variables 44 . We then performed post-hoc pairwise comparisons of Procrustes distances between least squares means, to determine which selection treatments differed significantly from each other 69 (Supplementary  Table 4). The data and R scripts for male and female body shape are provided in Supplementary Data 6-13 and the data and R scripts for gonopodium tip shape are provided in Supplementary Data 14-17.
Fecundity. In generation 8, for each of the 9 experimental lines we created 40 within-line pairs and recorded the number of offspring produced in the first brood and, if there was one, a second brood. Females were allowed 7-11 weeks (depending on the replicate) to produce offspring before being killed. We recorded the female SL. We compared the number of offspring produced (in the first brood and in total) using zero-inflated negative binomial mixed-effects models in glmmADMB, with replicate, selection treatment and female SL as fixed effects. Line identity was a random effect. We also treated breeding success as a binary variable in a model with binomial error and a logit link, using the same fixed and random effects. The mean number of offspring in the first brood and in total was 7.08 ± 0.53 and 9.28 ± 0.70 for Up, 6.58 ± 0. 49 (Tables 3 and 5, and Supplementary  Table 5). The data for within-line pair fecundity are provided in Supplementary Data 18.
Statistics. All statistical tests were conducted using R v3.2.2 (ref. 70) or SPSS 22.0 (IBM Corp., 2013). We used an a (significance) level of 0.05 and two-tailed tests. In general, where possible we included line identity as a random factor, because fish from the same selection line are not independent. We include replicate as a fixed effect, as there were too few levels to treat it as a random effect (Bolker et al. 71 recommend at least five to six levels for random effects). Some experimental and artificial selection studies treat only replicate as a random (or fixed) effect and exclude line identity, artificially inflating the sample size. We take a conservative approach and include line identity as a random effect. We present significance tests for fixed effects based on parameter estimates from the full model (that is, t ¼ mean/s.e.) and by comparing the full model with one without the term of interest using the Anova function in the R package car, which provides Wald's w 2 -tests (type III sum of squares) for generalized linear mixed models and F tests for general LMs. Unless otherwise stated, summary statistics and parameters estimated are given as mean ± s.e.
Data availability. Raw data files and R scripts used to generate the reported test statistics and summary statistics are provided as Supplementary Data 1