Spouses’ faces are similar but do not become more similar with time

The widely disseminated convergence in physical appearance hypothesis posits that long-term partners’ facial appearance converges with time due to their shared environment, emotional mimicry, and synchronized activities. Although plausible, this hypothesis is incompatible with empirical findings pertaining to a wide range of other traits—such as personality, intelligence, attitudes, values, and well-being—in which partners show initial similarity but do not converge over time. We solve this conundrum by reexamining this hypothesis using the facial images of 517 couples taken at the beginning of their marriages and 20 to 69 years later. Using two independent methods of estimating their facial similarity (human judgment and a facial recognition algorithm), we show that while spouses’ faces tend to be similar at the beginning of marriage, they do not converge over time, bringing facial appearance in line with other personal characteristics.

-that appearance converges as a function of shared actions and environment, and emotional mimicry-should apply to other personal characteristics as well. How does one reconcile the convergence in facial appearance with the lack thereof in the context of virtually all other traits, such as interests, personality, intelligence, attitudes, values, and well-being? A closer look at the literature reveals that while the convergence in physical appearance hypothesis is one of the tenets of current psychological science and has been widely disseminated through textbooks 23 , books 24,25 , and landmark papers 26,27 , it has virtually no empirical support. Zajonc et al. 's study 12 , while elegantly designed, was based on an extremely small sample of 12 married heterosexual couples. Furthermore, its findings have never been replicated. Two other studies occasionally cited in support of facial convergence (Hinsz 28 and Griffith and Kunz 29 ) neither tested this hypothesis nor provided any support for it. Both studies presented evidence for facial homogamy, i.e., spouses' tendency to have similar faces, but provided no support for the increase in facial resemblance over time. Hinsz 28 found that romantic partners' faces were more similar than those of random pairs of men and women, yet couples married for 25 years were no more similar than recently engaged ones. Griffith and Kunz 29 showed that student raters could match spouses' faces at a level above chance, yet found "no significant trend in growing to look alike as persons live together as husband and wife" (p. 453).
In this work, we aim to validate the physical convergence hypothesis in a large sample (n = 517) of white married heterosexual couples (we were unable to find a large enough sample of homosexual and non-white couples to allow for a meaningful analysis). Two approaches to measuring facial similarity were used: human judges and a modern facial recognition algorithm. Both approaches showed that while spouses' faces were similar at the outset of their marriage, they did not converge over time.

Methods
The study has been reviewed and approved by Stanford University's IRB. All methods were carried out in accordance with relevant guidelines and regulations. The preregistration documents can be found at https ://aspre dicte d.org/2fh78 .pdf. The Supplementary Information contains the list of and rationale for the post-registration changes to the study design. The materials, data, and code used to compute the results are available at https :// osf.io/ekwm7 . facial images. The facial images of 517 couples were collected from public online sources: 392 newspaper wedding anniversary announcements downloaded from https ://www.newsp apera rchiv e.com, 102 Google Search results, and 23 public profiles from Ancestry.com (a genealogy website). Two facial images of each spouse were collected: one taken within 2 years of the wedding, and one taken 20 to 69 years later (the marriage dates and dates on which the photos were taken were extracted from their captions; the average marriage length was 49 years).
Images were processed using Face++ (https ://www.facep luspl us.com)-a widely used facial recognition software-to detect facial outlines and head orientation, and to approximate individuals' age (see Supplementary  Fig. S1 for age distribution). We only included images containing faces larger than 120 × 120 pixels and with an absolute value of yaw and pitch below 55° and 24°, respectively. The images were converted into grayscale and cropped around the face to remove the background and non-facial details. Their brightness was corrected using the "auto-adjust colors" function in IrfanView 4.5. The faces were rotated to the vertical position and resized to 224 × 224 pixels (see Fig. 1).

Stimulus sets.
Faces were arranged into 2068 unique stimulus sets (517 couples × two spouses × two time points: at the beginning of the marriage and 20 to 69 years later) following the procedure from Zajonc et al. 12 Each face (target) was matched with faces of six other people of the opposite sex (alternatives): the target's spouse and five random others from our dataset. To control for the effect of age and eyewear, the alternatives had the same eyewear status (glasses or no glasses) and similar approximated age (+ /− 5 years) as the spouse. An example stimulus set is presented in Fig. 1.
Human judges and rankings. Judges (n = 153; from the U.S.), recruited on Amazon Mechanical Turk (AMT; an online crowdsourcing marketplace), were instructed to rank alternative faces from the most (1) to the least (6) closely resembling the target face (see Fig. 1). Ten rankings were obtained for each stimulus set. The spouses' perceived similarity at a given point in time (at the beginning of the marriage and later) was computed by averaging their ranks across two stimulus sets pertaining to them (husband as a target and wife as a target). The resulting scale ranged from 1 (all judges perceived them as most similar) to 6 (all judges perceived them as least similar). If there was no link between being married and similarity, the spouses' average rank should equal 3.5. The use of a relative (i.e., ranking) rather than absolute (e.g., Likert scale) measure of facial similarity enabled controlling for the possibility that people's faces may generally become more (or less) similar with time as they age.
Additionally, following Zajonc et al. 's 12 original design, a separate sample of 117 judges recruited on AMT were asked to rank alternatives in terms of their likelihood to be married to the target (the same stimulus as presented in Fig. 1 was used, with "closely resembles" replaced with "likely to be married to"). facial recognition algorithm and rankings. An alternative set of results was produced using VGGFace2 30 , a widely used facial recognition deep neural network that was shown to outperform humans in judging facial similarity 31  www.nature.com/scientificreports/ images taken at different times, with different devices, from different angles, and in different circumstances, they tend to capture features that remain stable across age and context, such as facial morphology and complexion. They are as unaffected as possible by transient features such as aging, facial expression, head orientation, hairstyle, and image properties such as background and lighting 32 . Consequently, they are well suited to the task of quantifying the similarity between faces, while controlling-as much as possible-for transient features.
Following the standard procedure used in facial recognition, cropped facial images were converted into 2048-value-long face descriptors using VGGFace2 in SE-ResNet-50 architecture and L2-normalized. Next, for each stimulus set, the cosine similarity between face vectors of the target and each alternative face was computed. The alternative faces were ranked from the most (1) to the least (6) closely resembling the target face (i.e., the same ranking scale as for human judges).

Statistical analyses.
The average similarity ranks of spouses' faces were compared with the chance value (3.5) using one sample two-tailed t-test to detect homogamy. Paired two-tailed t-tests were used to compare the similarity of spouses' faces at the beginning of marriage and later to detect the convergence in facial appearance. The average Kendall rank correlation between two randomly selected rankings for each stimulus set was used to measure inter-rater reliability. Figure 2 shows the similarity ranks produced by human judges (left panel) and VGGFace2 (right panel) at the time of marriage (blue bars) and 20 to 69 years later (green bars). The combined results for all age groups are shown on the gray background. Consistent with the previous studies 28,29,33-35 , we found evidence of homogamy, or spouses' tendency to have similar faces. At the time of marriage, their average rank was significantly lower than 3.5 (i.e., the rank expected if the alternatives were ranked randomly): 2.75 (95% CI = [2.69, 2.81], one sample t-test t = − 25.08, two-tailed p < 0.001, n = 517) for human judges; and 2.89 (95% CI = [2.76, 3.02], one sample t-test t = − 9.32, two-tailed p < 0.001, n = 517) for VGGFace2.

Results
However, we did not find evidence for the convergence in physical appearance hypothesis: Spouses' faces did not become more similar with time. In fact, according to human judges, spouses' faces became slightly less similar with time (paired t-test; t = − 3.70, two-tailed p < 0.001, n = 517), though the difference in the rankings was relatively small (Δ = 0.15, 95% CI = [0.07, 0.22]) and was not replicated in the VGGFace2 analysis. The same results were obtained when analyzing data separately for couples married for different lengths of time (Fig. 2): Spouses' faces tended to be similar but did not become more similar with time, regardless of the time span between the first and the second set of pictures.
Importantly, judgments' reliability did not vary with subjects' age or time when the picture was taken: There was no significant difference between the inter-rater reliability for pictures taken at the time of marriage and later , there were also no significant differences in judges' ratings of spouses' likelihood to be married between facial images taken at the time of marriage and later (paired t-test; t = − 1.51, two-tailed p = 0.13, n = 517; see Supplementary Table S2 for details).

Discussion
We do not find support for the widely disseminated convergence in physical appearance hypothesis: Spouses' faces are similar but do not converge with time. This brings facial appearance in line with other traits-such as interests, personality, intelligence, attitudes, values, and well-being-which show initial similarity but do not converge over time 2 .
This study has several limitations. First, we used publicly available images and thus could not control for variance in image properties and self-presentation (such as grooming, facial expression, or biases in selecting images to be publicly shared online). Yet, according to the convergence in physical appearance hypothesis, these factors should amplify the convergence rather than obscure it. Spouses' tendency to occupy the same environments, engage in the same activities, eat the same food, and-in particular-mimic each other's emotional expressions should result in convergence in their self-presentation behaviors, and thus more (and not less) similar public facial images. Second, we did not record or control for judges' age and ethnicity and thus the extent to which their judgments might have been affected by the own-age 36 and own-ethnicity 37 biases (people's lower sensitivity when judging the similarity of faces of other ages and ethnic groups). Yet, while the own-ethnicity bias could add noise to our measurements, it is unlikely to moderate the change in similarity over time, as participants' ethnicity was constant. Also, while the U.S. AMT workers tend to be young 38 , they were as good at ranking the similarity of faces of young people (taken several decades ago) as the faces of older people (taken more recently). Furthermore, those and other risks to the judges' accuracy were counterbalanced by the use of two independent measures of facial similarity (human judges and VGGFace2) and the relatively large sample size, enabling the detection of a change in human rankings as small as Δ = 0.17 (with 80% power, α = 0.001), an equivalent of one in six judges increasing a spouse's rank by just one position. Finally, the validity of our approach and dataset are supported by the successful replication of the well-established effect of people's tendency to marry similar others (i.e., homogamy).
While the rejection of the convergence in physical appearance hypothesis is surely not as exciting or as citeworthy as its counterfactual, it solves one of the major conundrums of psychological science and brings us closer to understanding factors predisposing people to form and maintain long-term romantic relationships.
Received: 2 April 2020; Accepted: 23 September 2020  www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.