Introduction

The resemblance between relatives or that like begets like has been observed since time immemorial, and the foundations of quantitative genetics were first laid when Fisher (1918) deduced the correlation between relatives on the assumption that the unknown loci controlling a quantitative trait behave in a Mendelian fashion. Two individuals are related if they have an ancestor in common. At any locus that affects a given trait, related individuals could inherit copies of the same allele found in the common ancestor. These copies of the same ancestral allele, which are described as being identical by descent, then make the relatives resemble each other.

Nonrelatives are individuals that are known to not have or assumed to not have any common ancestors in their respective pedigrees. Alleles shared by nonrelatives are alike in state (i.e., physically the same allele) but are not identical by descent. The probability of identity by descent can be assessed in a probabilistic manner from pedigrees. For example, the probability of identity by descent of alleles is 1/4 between a parent and its offspring, 1/4 between full sibs, and 1/8 between half sibs (Falconer 1960). In contrast, prior to the availability of molecular markers, there had been no way to assess the extent to which nonrelatives have alleles in common. As such, the covariance between nonrelatives has been implicitly assumed as zero in classical quantitative genetics. But is it always really zero?

If markers are in linkage disequilibrium with causal variants for a given trait, regressing the phenotypic covariance between nonrelatives on marker similarity (Ritland 1996; Lynch 1999) may reveal a nonzero covariance between nonrelatives. For complex traits in humans, the covariance among nonrelatives or distantly related individuals has been estimated in the context of genomewide prediction (de los Campos et al. 2013) or in resolving the issue of missing heritability in genomewide association studies (Yang et al. 2017). The results from these aforementioned studies in humans have indicated the presence of a covariance among nonrelatives.

Linkage disequilibrium is likely stronger in plants than in humans, yet empirical information is unavailable on the covariance between nonrelatives in crops. If a covariance among nonrelatives is frequent and strong enough, such a covariance would need to be considered when coupling modern genomics tools with quantitative genetics theory in plant breeding (Bernardo 2020). Furthermore, having a covariance between nonrelatives and between relatives would suggest the need for a broader theoretical framework for the covariance between individuals in general. My objective in this study was to determine if a covariance between nonrelatives is present for different traits in groups of unrelated maize lines.

Materials and methods

Maize lines, marker data, and phenotypic data

Phenotypic and single nucleotide polymorphism (SNP) marker data were from previous experiments conducted at the University of Minnesota (Schaefer and Bernardo 2013a, 2013b). The maize germplasm included 272 publicly developed lines and private lines whose US Plant Variety Protection certificates had expired. The lines were evaluated for anthesis date (growing degree days from planting to when 50% of the plants were shedding pollen), silking date (growing degree days from planting to when 50% of the plants had exposed silks), plant height (distance in cm from the soil surface to the flag leaf), ear height (distance in cm from the soil surface to the ear leaf node), kernel oil concentration (g kg−1), kernel protein concentration (g kg−1), and kernel starch concentration (g kg−1). Phenotyping was done in six location-year combinations in Minnesota in 2011 and 2012 (Schaefer and Bernardo 2013b). The phenotypic data analyzed herein were the least-squares means of each line for each trait across all six environments.

The lines were genotyped at 56,110 SNP loci on the Maize SNP50 BeadChip developed by Illumina (San Diego, California). Procedures for marker analysis of population structure were described by Schaefer and Bernardo (2013a) but are repeated here for convenience. Marker loci with a minor allele frequency less than 7% or with more than 10% missing data were disregarded, leading to 43,252 SNP loci. STRUCTURE software (Pritchard et al. 2000) was used to assess population structure among the lines. To help meet the assumption in STRUCTURE that the marker loci are in linkage equilibrium within subpopulations, a random subset of 3000 SNP loci was used in the model-based cluster analysis. This prior analysis by Schaefer and Bernardo (2013a) led to membership probabilities of each line for the following five subpopulations: A321 (Minnesota 13), B73 (Iowa Stiff Stalk Synthetic), Mo17, Oh43, and PH207 (Iodent) (Troyer 1999). There were 109 lines in the A321 subpopulation, 61 in the B73 subpopulation, 29 in the Mo17 subpopulation, 45 in the Oh43 subpopulation, and 28 in the PH207 subpopulation. For convenience, the five subpopulations are used herein to indicate line membership, so that ‘a B73 line’ refers to a line with primary membership in the B73 group rather than the line B73 itself.

Pairwise similarity and identifying pairs of unrelated lines

A set of 28,626 SNP loci with reduced multicollinearity was previously identified (Schaefer and Bernardo 2013b) via PLINK software (Purcell et al. 2007), according to a linkage disequilibrium maximum threshold of r2 = 0.9 within a sliding window of 50 markers. The 28,626 SNP loci were then used for calculating the pairwise similarity among the lines. The numbers of SNP loci on each of the 10 maize chromosomes (in parentheses) were as follows: (1) 4182, (2) 1972, (3) 3618, (4) 3343, (5) 3367, (6) 2518, (7) 2493, (8) 2682, (9) 2316, and (10) 2135. Marker similarity between each of the 36,856 pairs of lines was calculated on an allelic basis. At a given SNP locus with alleles M and m, similarity was 1.0 between lines that both had the MM genotype; 1.0 between lines that both had the mm genotype; 0 between an MM line and mm line; and 0.5 when one or both lines had the Mm genotype (i.e., Mm versus Mm, Mm versus MM, or Mm versus mm). The within-locus similarity was summed across loci and divided by the total number of SNP loci.

Unrelated lines were identified in the following manner. Consider lines i and j and subpopulation k, and assume that pik was the membership probability of line i in subpopulation k whereas pjk was the membership probability of line j in subpopulation k. The pikpjk value was calculated for each of the five subpopulations. Lines i and j were then considered unrelated when pikpjk did not exceed 0.01 for any of five subpopulations. In other words, pairs of unrelated lines were identified by finding those with membership-probability products that were less than 1% for all subpopulations. As indicated in the Discussion, this was likely a conservative approach for finding pairs of unrelated lines. Unrelated lines were identified across all subpopulations as well as for each of the 10 pairwise combinations of the five subpopulations (e.g., A321 lines versus B73 lines, A321 lines versus Mo17 lines, … Oh43 lines versus PH207 lines).

Estimating the covariance between nonrelatives

The presence of a covariance between nonrelatives was assessed by regressing the cross products between trait values of nonrelatives on the marker similarity between the nonrelatives. This procedure required the absence of a nongenetic covariance between lines, and this requirement was met by randomization of the lines within each location in which phenotyping was done. Suppose line i was primarily a member of subpopulation k. For a given trait, the mean of line i was first modeled as Yi(k) = µ + vk + gi + error, where µ was the overall mean, vk was the effect of subpopulation k (Yu et al. 2006), and gi was the effect of line i. The Yi(k) value was corrected for the overall mean and subpopulation effect, i.e., yi(k) = Yi(k) – \(\left( {\hat \mu + \hat v_k} \right)\).

When unrelated lines were considered while ignoring their subpopulation memberships, the cross product for unrelated lines i (in subpopulation k) and j (in subpopulation k’) was calculated as Cij = \(\left[ {y_{i\left( k \right)} - \bar y} \right]\left[ {y_{j\left( {k^{\prime}} \right)} - \bar y} \right]\), where \(\bar y\) was the mean corrected value of all the lines used in calculating the Cij values. When pairs of unrelated lines from specific subpopulations were considered (e.g., an A321 line and an unrelated B73 line), the cross product for unrelated lines i and j was calculated as Cij = \(\left[ {y_{i\left( k \right)} - \bar y_k} \right]\left[ {y_{j\left( {k^{\prime}} \right)} - \bar y_{k^{\prime}}} \right]\), where \(\bar y_k\) was the mean corrected value of the set of subpopulation k lines that were included in calculating the Cij values, and \(\bar y_{k^{\prime}}\) was the mean corrected value of the set of subpopulation k’ lines that were included in calculating the Cij values.

The covariance between nonrelatives was then assessed via the regression of Cij on Sij, where Sij was the marker similarity between unrelated lines i and j (Ritland 1996; Lynch 1999). For convenience and ease of interpretation, the Sij values were converted to percentages and expressed as a deviation from the mean. The regression of Cij on Sij is equal to the covariance only when the Sij values are standardized, and the initial analysis involved standardizing the Sij values. Such analysis proved less informative than the regression of Cij on nonstandardized Sij values: when comparing results across different subpopulations, it was helpful to assess the change in Cij per percentage-point change (rather than per standardized-unit change) in Sij. Hence, the results reported herein are regressions instead of covariances but the two are sometimes used interchangeably given the objective of determining whether a covariance (as reflected by the regression of Cij on Sij) between nonrelatives is present in maize.

The p-values for the regression coefficients were calculated via z-tests and a false discovery rate of 0.05 was imposed for the multiple comparisons made (Benjamini and Hochberg 1995). In addition to regressing Cij on the across-genome Sij, the regression of Cij on per-chromosome Sij was calculated. This analysis was conducted to assess if any significant regression was due to similarity across most or all chromosomes, or was due to similarity on specific chromosomes.

For reference purposes, the covariance between relatives was also assessed. Most pairs of lines within a given subpopulation were expected to be related, and the regression of Cij of Sij was calculated for all pairs of lines within each of the five subpopulations. A false discovery rate of 0.05 was imposed on the within-subpopulation regression coefficients (Benjamini and Hochberg 1995).

Results

Among the 272 maize lines, there were 5278 pairs of unrelated lines across the five subpopulations (Table 1). When the analysis was restricted to unrelated lines between two subpopulations, the fewest pairs of unrelated lines was 33 for the A321 and Oh43 groups; these 33 pairs of unrelated lines involved only seven A321 lines and five Oh43 lines. The largest number of pairs of unrelated lines was between the A321 and B73 groups, with 1214 unrelated pairs that involved 52 A321 lines and 27 B73 lines. The number of pairs of related lines ranged from 378 within the PH207 group and 5886 within the A321 group (Table 1).

Table 1 Marker similarity, calculated from 28,626 single nucleotide polymorphism loci, among related lines and among unrelated lines in maize.

The marker similarity among the 5278 pairs of unrelated lines (across all subpopulations) ranged from 0.435 to 0.646 and had a mean of 0.586 (Table 1). The mean similarity was lowest (0.529) between unrelated B73 and Mo17 lines and was highest (0.622) between unrelated A321 and Oh43 lines. The range in similarity was widest between unrelated B73 and Mo17 lines (0.435 to 0.619) and was narrowest between unrelated A321 and Oh43 lines (0.610 to 0.637). Marker similarity was higher between related lines (mean of 0.656) than between unrelated lines (mean of 0.586) (Table 1). Within-group similarity was highest among B73 lines (mean of 0.729), and the highest similarity (0.980) was between B73 itself and line F42 within the B73 subpopulation.

With seven traits and 10 pairs of subpopulations plus the overall set of unrelated lines, there were 77 regressions of cross products between unrelated lines (Cij) on marker similarity (Sij). Out of the 77 regressions, the following seven were significant at a false discovery rate of 0.05: anthesis and silking dates in unrelated B73 and Oh43 lines; plant height and ear height in unrelated Oh43 and PH207 lines; oil in unrelated A321 and Mo17 lines; protein in unrelated B73 and Mo17 lines; and starch in unrelated B73 and PH207 lines (Fig. 1). The strongest covariance was for protein in unrelated B73 and Mo17 lines, for which the regression coefficient was –5.8 g kg−1 per percentage change in marker similarity and the correlation between Cij and Sij was −0.14. The second strongest covariance was for ear height in unrelated Oh43 and PH207 lines, for which the regression coefficient was 8.9 cm per percentage change in marker similarity and the correlation between Cij and Sij was 0.11. The weakest significant covariance was for starch in unrelated B73 and PH207 lines, for which the regression coefficient was 11.6 g kg–1 per percentage change in marker similarity and the correlation between Cij and Sij was 0.06.

Fig. 1
figure 1

Seven instances of a significant covariance between nonrelatives in maize, as assessed by the regression of cross-products between nonrelatives (Y-axis) on across-genome marker similarity.

For protein in unrelated B73 and Mo17 lines, the regression coefficients of Cij on per-chromosome Sij had low p-values across all 10 chromosomes (Table 2). In contrast, for ear height in unrelated Oh43 and PH207 lines, the p-values for regression coefficients were 0.02–0.07 for chromosomes 1, 2, 6, 7, and 8 but were 0.19–0.34 for chromosomes 3, 4, 5, 9, and 10.

Table 2 Per-chromosome regression of nonrelative cross-products on marker similaritya for protein in unrelated B73 and Mo17 lines and for ear height in unrelated Oh43 and PH207 lines.

The overall covariance between relatives was significant (P = 0.05) for all traits except ear height (Table 3). Within the A321 group, the covariance between relatives was significant for all seven traits. But within the Mo17 group, none of the regression coefficients of Cij on Sij was significant. Within the B73 group, the regression coefficients were nonsignificant for ear height and oil but were significant for the five other traits (Table 3, Fig. 2). Within-group covariances were significant for the two flowering-date traits in all subpopulations except Mo17. Regression coefficients were much smaller for oil than for protein and starch.

Table 3 Regression of relative cross-products on marker similaritya among related lines for different traits in maize.
Fig. 2: Regression of nonrelative cross-products on across-genome marker similarity for different traits in the B73 group.
figure 2

Results are not shown for oil concentration, for which the regression was near zero and nonsignificant and the scatterplot was similar to that for ear height.

Discussion

Weak covariance between nonrelatives for certain traits and subpopulations

The results showed that in maize, a covariance between nonrelatives is not ubiquitous but is sometimes present for specific traits and for certain groups of unrelated individuals. Such a covariance was generally weak and most often positive, yet a negative covariance between nonrelatives is also possible when it is assessed via a random set of genomewide markers. The covariance among nonrelatives was due to similarity across the maize genome in some cases and to similarity at specific chromosomes in other cases. While the results were from a single panel of maize lines, the lines represented the key germplasm groups in US maize (Troyer 1999, Schaefer and Bernardo 2013a) and they were enough to show that a covariance between nonrelatives is sometimes present in maize. The presence of a covariance among nonrelatives challenges a long-held, implicit assumption in quantitative genetics and requires a rethinking of how the covariance between individuals of varying levels of relatedness is best expressed.

In this study, unrelated lines were identified not from pedigree records (which were often incomplete) but on the basis having a joint probability of membership less than 0.01 for each of the five subpopulation assignments given by Schaefer and Bernardo (2013a). Some pairs of lines that were unrelated by pedigree were excluded from the sets of unrelated lines. For example, the French lines F2 and F7 were both developed by self-pollination from the Lacaune landrace (Tenaillon and Charcosset 2011) and, according to their pedigrees, are unrelated to the US line B73. The membership probabilities in the B73 subpopulation were 1.0 for B73, 0.03 for F2, and 0.08 for F7. Given that the product of their membership probabilities for the B73 subpopulation exceeded 0.01, F2 and F7 were excluded from the set of lines unrelated to B73 even if their pedigree records indicated otherwise.

The foregoing point suggested that the assessment of the covariance between nonrelatives in this study may have been too conservative. For example, the nonzero membership probabilities of F2 and F7 in the B73 subpopulation indicated some level of alikeness in state with B73. If F2, F7, and other lines unrelated by pedigree were included among the lines unrelated to B73, the range in Sij values among the resulting unrelated lines could have increased. Such expansion in the range of the x-axis could then have led to a stronger regression coefficient. This conservative criterion aside, a key point worth considering is the very definition of two individuals being unrelated.

In particular, the results herein underscored how the definition of relatedness differs from an evolutionary genetics perspective versus a quantitative genetics perspective. If a particular crop species emerged from only one domestication event, as might have been the case with maize (Matsuoka et al. 2002) and rice (Oryza sativa L.) (Molina et al. 2011; Huang et al. 2012), then all individual plants of that species would be considered as related because of their singular ancestry. On the other hand, inferences in classical quantitative genetics rely on having a real or conceptual population that is in Hardy-Weinberg equilibrium and in which the individuals are assumed non-inbred and unrelated (Falconer 1960). If the common ancestry was in the distant past, then the probability of identity by descent between two individuals will be close to zero and will have no meaningful contribution to the covariance between relatives. Relatedness in this study was viewed according to this quantitative genetics perspective.

The results indicated that a covariance between nonrelatives can be detected even when the range in Sij values among unrelated lines is small. For example, the regression of Cij for plant height and ear height on Sij between unrelated Oh43 and PH207 lines was significant despite the range in Sij values being only 0.035 (Table 1). On the other hand, the wide range (0.184) and low mean (0.52) of Sij values between unrelated B73 and Mo17 lines reflected an ascertainment bias in the development of the Maize SNP50 BeadChip. This Illumina SNP chip was developed using B73 as the reference genome, and heterozygosity of the SNP markers in the B73 × Mo17 cross was a criterion used to evaluate the utility of the SNP chip (Illumina 2012). It was therefore unsurprising that the mean similarity between unrelated B73 lines and Mo17 lines was lower than the mean similarity for the other pairs of subpopulations.

As expected, most of the covariances among relatives were significant with the notable exceptions of the Mo17 subpopulation, for which the regression of Cij on Sij was nonsignificant for all traits, and the PH207 subpopulation, for which the regression was nonsignificant for ear height and the three kernel composition traits (Table 3). The Mo17 and PH207 subpopulations had the fewest lines and they also tended to have narrowest ranges in line means for the traits studied (Schaefer and Bernardo 2013b).

Implications

Several previous studies that implicitly recognized a covariance between nonrelatives focused on how common causal variants contribute to a correlation between nonrelatives. For prediction of height among unrelated human subjects, the use of markers with the lowest p-values (from genomewide association analysis) led to higher prediction accuracies compared with using all of the markers (de los Campos et al. 2013). Multiple genomewide association studies (summarized by Visscher 2008) for height in humans have involved estimating heritability using markers with significant effects in different human populations. The current study differed from these aforementioned studies in that no attempt was made to estimate the similarity among unrelated individuals from a subset of markers with low p-values. That being said, the importance of specific chromosomes with regard to the covariance among nonrelatives was evident in the regression of Cij on per-chromosome Sij (Table 2).

Previous analyses involving the regression of Cij on Sij in unrelated human subjects (Kemper et al. 2021) equated the regression coefficient to the additive genetic variance (Ritland 1996; Lynch 1999). In contrast, the regression coefficients in this study were used to assess covariances but not genetic variances which, unlike covariances, are positive by definition. The negative regression of Cij for protein on Sij in unrelated B73 and Mo17 lines (Fig. 1) was a unique result that indicated that, unlike the covariance between relatives which is always positive, the covariance between nonrelatives can be negative when it is assessed via genomewide markers.

The negative covariance for protein in unrelated B73 and Mo17 lines was likely due to differences in linkage phases between the two subpopulations; this same phenomenon has been recognized as a reason for a low accuracy of genomewide prediction between different populations (de Roos et al. 2009). Suppose the marker alleles at a locus are denoted by M and m whereas the causal alleles are denoted by Q and q. Furthermore, suppose that most of the gametes are in coupling phase in one subpopulation (MQ and mq) whereas most of the gametes are in repulsion phase in the second subpopulation (Mq and mQ). In this situation, similarity at the marker locus would be associated with dissimilarity at the causal locus, thus leading to a negative covariance. Thus, while any covariance among nonrelatives should be positive when similarity is directly assessed at the causal loci themselves, a negative covariance may arise if genomewide markers are used as proxies for the unknown causal variants.

This study focused on the initial step of determining whether a covariance between nonrelatives exists in maize, and investigations of the practical significance of this finding are deferred to future studies. Because the covariances between nonrelatives in this study were mostly weak, they could probably be safely ignored without much practical consequence. Such an approach is currently being used in genomewide prediction, for which the typical procedure is to either (1) capture identity by descent and exclude non-identity by descent using markers (Bernardo 1993) or (2) include both identity by descent and alikeness in state without making a distinction between the two (de los Campos et al. 2013, Lorenz and Smith 2015). However, if the covariance between nonrelatives is substantial (e.g., negative covariance for protein in unrelated B73 and Mo17 lines in Fig. 1), it might be advantageous to explicitly account for it when expressing the covariance between individuals.

At a single locus, the expectation of Sij is fij + θij, where fij is probability that a random allele from i and a random allele from j are identical by descent (i.e., coefficient of coancestry), and θij is the probability that a random allele from i and a random allele from j are alike in state, given that they are not identical by descent (Cox et al. 1985; Lynch 1988). In other words, unless unrelated individuals do not share any marker alleles, Sij includes a portion due to relatedness (fij) and a portion due to nonrelatedness (θij). This point also implies that whereas the covariance between nonrelatives was studied herein in isolation by focusing only on unrelated lines, a covariance between nonrelatives can play a role even among related lines. If dominance is absent, the covariance due to identity by descent is equal to 2fijVA, where VA is the additive genetic variance (Falconer 1960). In an analogous manner, we define the portion of the covariance due to nonrelatedness as equal to 2θijCovU, where CovU is the covariance between nonrelatives.

Suppose three pairs of lines all have Sij = 0.75. If, in accordance with current practice, no distinction is made between the covariances due to relatedness and due to nonrelatedness, the value of the covariance will be identical for these three pairs of lines. In contrast, suppose a distinction is made between relatedness and nonrelatedness, and that the common Sij of 0.75 corresponds to (1) fij = 0.15 and θij = 0.60 for the first pair of lines, (2) fij = 0.20 and θij = 0.55 for the second pair of lines, and (3) fij = 0.25 and θij = 0.50 for the third pair of lines. Furthermore, suppose that VA (100) is larger than CovU (10). In this hypothetical example, the covariances between lines (calculated as 2fijVA + 2θijCovU) are now unequal and are 42 for the first pair, 51 for the second pair, and 60 for the third pair.

Accounting for the covariances due to relatedness and nonrelatedness could therefore alter the estimated covariances between individuals and, consequently, affect genomewide prediction and other procedures that rely on the covariance between individuals. This topic, along with how to partition Sij into fij and θij, is the focus of a follow-up study.